Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 10;15:24830. doi: 10.1038/s41598-025-05958-2

An ODE based neural network approach for PM2.5 forecasting

Md Khalid Hossen 1,2,3,, Yan-Tsung Peng 2, Asher Shao 3, Meng Chang Chen 3
PMCID: PMC12246124  PMID: 40640232

Abstract

Predicting time-series data is inherently complex, spurring the development of advanced neural network approaches. Monitoring and predicting PM2.5 levels is especially challenging due to the interplay of diverse natural and anthropogenic factors influencing its dispersion, making accurate predictions both costly and intricate. A key challenge in predicting PM2.5 concentrations lies in its variability, as the data distribution fluctuates significantly over time. Meanwhile, neural networks provide a cost-effective and highly accurate solution in managing such complexities. Deep learning models like Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) have been widely applied to PM2.5 prediction tasks. However, prediction errors increase as the forecasting window expands from 1 to 72 hours, underscoring the rising uncertainty in longer-term predictions. Recurrent Neural Networks (RNNs) with continuous-time hidden states are well-suited for modeling irregularly sampled time series but struggle with long-term dependencies due to gradient vanishing or exploding, as revealed by the ordinary differential equation (ODE) based hidden state dynamics–regardless of the ODE solver used. Continuous-time neural processes, defined by differential equations, are limited by numerical solvers, restricting scalability and hindering the modeling of complex phenomena like neural dynamics–ideally addressed via closed-form solutions. In contrast to ODE-based continuous models, closed-form networks demonstrate superior scalability over traditional deep-learning approaches. As continuous-time neural networks, Neural ODEs excel in modeling the intricate dynamics of time-series data, presenting a robust alternative to traditional LSTM models. We propose two ODE-based models: a transformer-based ODE model and a closed-form ODE model. Empirical evaluations show these models significantly enhance prediction accuracy, with improvements ranging from 2.91 to 14.15% for 1-hour to 8-hour predictions when compared to LSTM-based models. Moreover, after conducting the paired t-test, the RMSE values of the proposed model (CCCFC) were found to be significantly different from those of BILSTM, LSTM, GRU, ODE-LSTM, and PCNN,CNN-LSSTM. This implies that CCCFC demonstrates a distinct performance advantage, reinforcing its effectiveness in hourly PM2.5 forecasting.

Subject terms: Environmental sciences, Environmental impact

Introduction

Air pollution, caused by the emission of various harmful pollutants, poses significant threats to human health. According to the World Health Organization, air pollution is responsible for 7 million deaths per year. Among these pollutants, fine particulate matter (PM2.5), with particles smaller than 2.5 micrometers, is a leading contributor to air quality issues, especially in developing countries, where it burdens the environmental and social systems1,2. Understanding PM2.5 levels is critical as it directly influences public health and policy decisions. PM2.5 particles are small enough to penetrate the respiratory system and circulatory barriers, causing severe health complications such as respiratory and cardiovascular diseases2,3. To address this challenge, many nations have established air quality monitoring stations to collect pollutant data. However, for policymakers and the public, predicting air quality for the next hour or longer is crucial. Accurate, ambient, and timely predictions enable governments to issue warnings, manage outdoor activities, and implement emergency measures, particularly in cities facing severe air pollution4,5. These forecasts help to take actions such as traffic control during pollution spikes and optimize resident travel plans, thus playing a pivotal role in mitigating air quality-related risks.

Predicting PM2.5 levels is particularly challenging due to the highly dynamic and fluctuating nature of the data. In Fig. 1 illustrates the monthly variations in our study dataset from 2014 to 2019, highlighting the significant changes over time, from January to June PM2.5 value decreases then from June to December its increases. The data have changed significantly from January to December in every month. This work is motivated by several critical factors.

  • Public safety: Accurate PM2.5 predictions are essential to advise people about the safety of outdoor activities and gatherings, especially in crowded areas.

  • Pollution Control: Reliable forecasts help governments identify and address regions with severe air pollution, enabling timely interventions.

  • Operational Challenges: Air quality monitoring stations, commonly deployed in cities around the world, require substantial operational costs and manpower to maintain.

  • Data Drifting: Abrupt changes in pollutant levels complicate prediction efforts, making it difficult for machine learning models to provide accurate forecasts6.

  • Wildfire Smoke Prediction: In wildfire-prone regions like California, PM2.5 forecasting can predict the spread of smoke plumes, aiding emergency services in implementing evacuation plans and optimizing resource allocation.

  • Agricultural Applications: Forecasting PM2.5 can help assess air quality impacts on crop yields and ecosystems, enabling better farm management and sustainability practices.

  • Tourism Management: Tourism boards can utilize hourly PM2.5 predictions to plan events and provide advisories, ensuring safer experiences for tourists and outdoor activities.

Accurately forecasting PM2.5 concentrations is essential for effective air quality management and public health guidance, allowing individuals to make well-informed travel choices, reduce exposure to PM2.5-related health hazards, and protect themselves from severe weather conditions7,8. The prediction of PM2.5 levels primarily constitutes a time-series forecasting problem, where past and present observations are utilized to estimate future concentrations9. The complexity of this task stems from the nonlinear and nonstationary nature of PM2.5 data, influenced by numerous external variables911. Research approaches encompass conventional time series analysis, machine learning algorithms, and deep learning techniques. An efficient system for monitoring and forecasting air pollution in advance is crucial for protecting public health and supporting informed governmental policy-making. However, the formation and behavior of PM2.5 are highly intricate, owing to the complexity of its characteristics–such as nonlinear variations across time and space12–which significantly influence prediction accuracy. Therefore, this challenge warrants thorough investigation and careful consideration13. Previous research has predominantly relied on time series-based models, such as the Autoregressive Moving Average (ARMA) and the Autoregressive Integrated Moving Average (ARIMA), for forecasting PM2.5 concentrations14. However, these approaches often struggle to effectively capture the nonlinear nature of the data, resulting in suboptimal prediction performance. Sharma et al.15 proposed a novel prediction framework that combines a multi-phase feature selection strategy–with optimal feature and lag window identification–with a Long Short-Term Memory (LSTM)-based autoencoder and a temporal convolutional autoencoder to enhance forecasting accuracy. Ting et al.16 proposed a water quality forecasting model by developing a Spatial-Temporal Graph Convolutional Network (STGCN) to enhance the understanding of spatial-temporal dependencies. To further capture the underlying characteristics of water quality data, they employed Pearson correlation analysis and seasonal decomposition techniques, which contributed to improved prediction accuracy. Sharma at el.17 propose a novel stacked deep learning framework that integrates Bi-directional Long Short-Term Memory (Bi-LSTM) networks, 1D convolutional and max-pooling layers within each stack, enhanced by an improved Adaptive Moment Estimation algorithm and an error correction mechanism to boost accuracy and convergence speed, and evaluate its performance on four real-world datasets includes PM2.5 datasets. Table 1 lists the mean, standard deviation, minimum value, and maximum value for the year 2014 to 2019.The mean value for 2014–2016 reduced slowly and lowest mean for 2019. However, existing methods often struggle to capture the complex temporal dynamics and non-linear relationships inherent in PM2.5 data. Recent advancements in machine learning, particularly Neural Ordinary Differential Equations (ODEs), offer a promising solution by modeling continuous-time processes. Neural ODEs based neural network can potentially improve the hourly forecasting accuracy by capturing smooth transitions and intricate dependencies in environmental data. The key contributions of this study are as follows: Firstly, this study combines the pollutant components, meteorological data, and adjacent stations in different time periods into the input variables. After preprocessing the data by filling in the missing values, encoding, and normalizing data that increases the accuracy of the hourly PM2.5 predictions.

Fig. 1.

Fig. 1

The average PM2.5 concentrations in 2014–2019.

Table 1.

The mean and std, min and Max values of PM2.5 from year 2014–2019.

Y2014 Y2015 Y2016 Y2017 Y2018 Y2019
Mean 20.94 18.08 16.92 16.4 14.89 13.37
Std 13.88 13.23 12.09 11.35 9.942 9.06
Min 1 0 −3 1 1 −2
Max 167 131 130 122 109 112

Moreover, the objective is to propose the ODE based model with the convolutional neural network and also with the self-attention mechanism of Transformers to effectively learn long-term dependencies and interactions among meteorological and air quality features for hourly PM2.5 predictions. By comparing the performance of different popular deep-learning methods for time series data prediction, we validated the practicality and feasibility of our proposed model in forecasting PM2.5 concentrations based on evaluation metrics. The comparative analysis confirms the effectiveness of our approach in accurately predicting hourly PM2.5 levels across forecasting horizons ranging from 1 to 72 hours.

Previous work

Neural networks for PM2.5 time series prediction

Kristiani et al.18 proposed a PM2.5 forecasting model using a sequence-to-sequence approach of LSTM. To identify important features, they employed various statistical methods, including correlation analysis, XGBoost, and chemical analysis methods. For their study, they selected data from Taichung City, spanning 2014-2018. Using sequence-to-sequence prediction in the LSTM model, the authors focused on feature selection to improve neural network predictions, finding that it reduced overfitting, improved model accuracy, and decreased training times. The model was divided into five categories, each using distinct parameters for prediction. They also observed that feature selection improved the RMSE compared to using all features indiscriminately. In the paper19, the author enhances the stock market prediction through image encoding, pattern recognition, and ensemble learning with custom error correction techniques that represents a novel approach in the field of financial techniques. Ravi et al.19 convert the financial data into image data to leverage the image processing data. Converting the encoding with machine learning requires significant computational resources and expertise which will take a lot time times too which may influence the effect of prediction. Hasnain et al.20 paper discussed a novel time series ensemble approach is introduced, incorporating both linear models and nonlinear models called nonparametric autoregressive and neural network autoregressive . Additionally, three ensemble models are developed, each utilizing a distinct weighting strategy ,all the models used for predictions for future PM2.5 prediction . In this work,20 ensemble models often require more computational resources and time compared to single models. This complexity can make them less efficient for real-time forecasting if not optimized properly will increase the overall complexity also author did not focus on hourly PM2.5 prediction, moreover ensemble method take lot of times and large datasets for training models, and multiple models also have overfitting issues.

Tong et al.21 introduced a Transformer-based model for air quality prediction, demonstrating strong performance in comparison with several baseline models, including LSTM, Bi-LSTM, Linear Regression, and heuristic approaches. Their study emphasized the importance of high-resolution spatial and temporal data for accurate PM2.5 forecasting. While integrating Transformer architectures with advanced techniques such as graph neural networks and autoencoders can further enhance predictive accuracy, it also increases model complexity and demands careful design and integration. The Transformer model has recently emerged as a powerful architecture in both natural language processing (NLP) and broader machine learning tasks22. Initially designed for sequence modeling in NLP, it has since been adapted to a wide range of domains. Notable applications include language understanding with models such as BERT23, computer vision tasks via Vision Transformers (ViT)24, and object detection. Moreover, Transformers have demonstrated strong performance in decision-making for reinforcement learning25, as well as in multimodal data processing involving text and speech26. These advances highlight the model’s versatility and its capacity to handle complex, high-dimensional data across diverse fields. The authors27 discussed a CNN-based model neural network for prediction of one kind of disease.

The authors 28 proposed a composite neural network model that incorporates aerosol optical depth (AOD), satellite-derived weather data, and interpolated ocean wind features that based on hybrid of CNN-LSTM model. Recently, Hossen et al.6 discussed a possible solution for PM2.5 data drift by proposing a CNN-based attention model and a specialized loss function for hourly PM2.5 prediction.

Liquid time constant in time series prediction

Ordinary differential equations (ODEs) are mathematical equations dependent on different variables and cannot operate independently between variables. Recently, recurrent neural networks integrated with ODEs have been applied to time-series predictions. In29, the state of the ODE, Inline graphic, is defined as the solution which is defined by the solution Inline graphic, where f with a neural network f is a neural network parameterized by Inline graphic. ODEs facilitate RNNs by calculating continuous hidden states, allowing f to depend on the hidden state x(t), and dynamically adjusting the time constant Inline graphic. The time constant here characterizes the speed and performance of the ODE, which is critical in learning systems. As an arbitrary solver, the liquid time constant(LTC) is applied to the ODE framework. Grathwohl and Chen 29 suggest that neural ODEs maintain a constant memory cost during training for each network layer by using the adjoint sensitivity method for automatic differentiation.

Continuous-time neural networks

Continuous-time neural networks represent a class of machine learning systems designed for decision making tasks, comprising differential equations that capture the dynamics of neurons and synapses, fundamental components of both biological and artificial neural networks. This includes Liquid Time Constant (LTC) networks and models that use ordinary differential equations (ODEs), which are highly expressive and capable of modeling time as a continuous vector field, transforming the time dimension of the recurrent neural network into a continuous process3032. These continuous neural networks can handle irregularly sampled data by using continuous-depth models.

Although ODE-based neural networks perform comparably with advanced discretized recurrent models, especially on small datasets, they tend to train more slowly due to the computational demands of differential solvers33. Training these networks, particularly for applications in time series prediction, medical data processing, and autonomous driving, can become cumbersome. However, closed-form continuous models avoid the need for intermediate data solvers, enhancing efficiency. Compared with other deep learning approaches, continuous models based on ODEs often demonstrate superior performance, particularly for time-series predictions. In LTC networks, the fundamental component equation describes s(t), representing synaptic current, where a postsynaptic neuron receives stimuli I(t) through a non-linear model based on conductance.

Although ODE-based neural networks with gradient propagation design perform competitively with advanced discretized recurrent models in relatively small sizes, their trainings are slow due to the computational demands of differential solvers33. Training these networks, particularly for applications in time series prediction, such as medical data processing and autonomous driving, can become cumbersome. However, closed-form continuous models avoid the need for intermediate data solvers, enhancing efficiency. Compared with other deep learning approaches, continuous models based on ODEs often demonstrate superior performance, particularly for time-series predictions. The equation for the fundamental building block of LTC networks that consists of s(t) representing the synaptic current and a postsynaptic neuron receives the stimuli I(t) through a nonlinear conductance-based model.

graphic file with name 41598_2025_5958_Article_Equ1.gif 1
graphic file with name 41598_2025_5958_Article_Equ2.gif 2
graphic file with name 41598_2025_5958_Article_Equ3.gif 3

Here Inline graphic is the time constant of the postsynaptic neuron. x(t) represents the postsynaptic neuron potential, A is the synaptic reversal potential, f (.) denotes the nonlinearity of synaptic release, and Inline graphic is the time constant of the postsynaptic neuron.

Novelty of aproach for TRCFC and CCCFC model with LSTM

Unlike Transformers, LSTMs do not inherently include mechanisms to focus on specific parts of the input sequence, which can be crucial for multivariate forecasting tasks. Transformers leverage self-attention mechanisms to dynamically assess the importance of different input elements relative to each other22. This capability is particularly advantageous for multivariate time series forecasting, where various variables exert differing influences on PM2.5 levels over time .In LSTMs, which process sequences sequentially, Transformers can process input sequences in parallel. This parallelization significantly speeds up training times, making them more suitable for large datasets. While LSTMs are designed to handle long-term dependencies, they can struggle with very long sequences. Transformers, with their attention mechanisms, can more effectively capture complex patterns across long sequences without the need for sequential processing. Closed-form ODE models can provide smooth interpolations and extrapolations between observed data points, which is particularly useful for predicting PM2.5 levels at unobserved times or locations34. ODE models can be more interpretable than LSTM models because they explicitly describe the dynamics of the system. This interpretability can help in understanding how different factors influence PM2.5 levels.

Continuous-time model dynamics is fundamental in machine learning, control theory, and dynamical systems3538. Continuous-time neural networks (CTNNs) using neural ODEs have expanded these applications by enabling vector field representations that cannot be captured by traditional discrete neural networks33,39. This approach supports flexible density estimation29 and allows efficient modeling of sequential and irregularly sampled data30,33,40. ODE-based networks are based on sophisticated solvers, which can affect efficiency, resilience41, and performance30. Methods for stabilizing gradient propagation have been developed to enhance their effectiveness, particularly for time series tasks4244.

Existing models

LSTM, BILSTM, GRU, CNN-LSTM and PCNN

BiLSTM is a technique that has been used both forward and backward and allows passing the input in both directions, that also contains past and future information45. The Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) architecture widely used for sequential data processing. Compared to other architectures, such as LSTM, GRUs are generally more computationally efficient due to their simplified structure, making them a popular choice for applications including time series prediction and natural language processing46,47. In the study conducted by Bekkar et al.13, the authors evaluated the performance differences among various deep learning algorithms–including LSTM48, Bi-LSTM49, GRU, Bi-GRU, CNN, and a hybrid CNN-LSTM model–for the prediction of PM2.5 concentrations.

The Parallel Common Combined CNN (PCNN) model, illustrated in Fig. 2, is designed to integrate the outputs of neural network components for repetitive predictive tasks such as PM2.5 forecasting50. In this model, annual air quality data were used as input to evaluate and compare the performance of the PCNN model against the proposed architectures. As shown in Fig. 2, the input data is first processed through a 1D Convolutional Neural Network (CNN1D) layer, followed by a Bidirectional Long Short-Term Memory (BiLSTM) layer and a Dense layer. In the final stage, the outputs from these layers are concatenated to generate the final predictions. During model training, a total of 64,221 parameters were involved, all of which were trainable, indicating that the entire network contributed to learning without any frozen layers.

Fig. 2.

Fig. 2

The block diagram PCNN.

ODE-LSTM

An ordinary LSTM cell can be seen from Fig. 3 where memory call and hidden cell can be written in Inline graphic and Inline graphic. From Fig. 3, the block ODE-LSTM means the ODE-LSTM (Ordinary Differential Equation LSTM) is an innovative architecture that combines the strengths of LSTM and continuous-time dynamics, allowing it to effectively model irregularly sampled time series data and overcome some limitations of traditional LSTMs.The initial ODE51can be written as

graphic file with name 41598_2025_5958_Article_Equ4.gif 4

where Inline graphic is the hidden state, Inline graphic is time, Inline graphic, Inline graphic is the function parameterized by Inline graphic that updates the hidden state. After combining the LSTM cell and ODE solver the main equation for ODE-LSTM will be below

graphic file with name 41598_2025_5958_Article_Equ5.gif 5
graphic file with name 41598_2025_5958_Article_Equ6.gif 6

where Inline graphic solves the continuous dynamics for the hidden state using the ODE function Inline graphic, Inline graphic is the time interval over which the ODE is solved. The output h(t) is regarded as the final output of the ODE-LSTM model. As illustrated in Fig. 3, the input data is first processed through the LSTM layer, which captures temporal dependencies in the sequence. The resulting hidden states are then passed through an ODE solver, which models the continuous dynamics of the system. In the final stage, the processed hidden states are fed into a Dense layer to transform them into the desired output space.

Fig. 3.

Fig. 3

The ODE-LSTM architecture.

Common convolutional closed form continuous-time neural networks(CCCFC)

The primary differential equation used follows a general form31. For PM2.5 hourly data predictions, meteorological and temporal data are used in experiments to predict future PM2.5 concentrations, x(t) at a specific time t. The main differential equation is in the general form31

graphic file with name 41598_2025_5958_Article_Equ7.gif 7

where x(t) represents the hidden state capturing system dynamics at time t, f is a neural network parameterized by theta Inline graphic, and t denotes the number of time samples. The initial condition for the hidden state Inline graphic at the starting time is

graphic file with name 41598_2025_5958_Article_Equ8.gif 8

To obtain closed-form solutions to the differential equations, we consider the expression:

graphic file with name 41598_2025_5958_Article_Equ9.gif 9

Thus, while integration is typically approximated numerically, closed-form solutions for continuous-time neural networks aim to provide direct analytical expressions. To extract features from input datasets, let A represent the input data, which can be processed by a CNN as: CNN(A). Here, the closed-form solutions for continuous time neural networks34 can be written as Closed-Form Solution:

graphic file with name 41598_2025_5958_Article_Equ10.gif 10

We use a fully connected neural network (FC) as the last layer of CCCFC:

graphic file with name 41598_2025_5958_Article_Equ11.gif 11

Transformer model

The Transformer model architecture, as illustrated in Fig. 4, consists of a stacked encoder-decoder framework that leverages multi-head self-attention and feed-forward layers. In our experiment, we have proposed a transformer-based model for PM2.5 prediction. The encoder processes input sequences to capture contextual relationships, while the decoder generates outputs by attending to both the encoder outputs and previously generated tokens. The details are described in sections below

Fig. 4.

Fig. 4

The block diagram for transformer architecture.

Self-attention mechanism

The Transformer self-attention mechanism computes attention weights between every pair of time steps. First, the input sequence is projected into query, key, and value vectors represented as Q, K, and V, with their corresponding learnable weighted matrices as Inline graphic, Inline graphic, and Inline graphic. The attention function is defined as follows 22, where Inline graphic is the dimension of K.

graphic file with name 41598_2025_5958_Article_Equ12.gif 12

Multi-head attention

To enable the model to focus on different parts of the sequence, multi-head attention is used:

graphic file with name 41598_2025_5958_Article_Equ13.gif 13

where each Inline graphic is an attention and all Inline graphic are concatenated and multiple a learned weight matrix Inline graphic.

Positional encoding

To account for the Transformer’s permutation-invariance, we introduce positional encodings to preserve the temporal order of the sequence:

graphic file with name 41598_2025_5958_Article_Equ14.gif 14

Then, the input sequence is modified as follows:

graphic file with name 41598_2025_5958_Article_Equ15.gif 15

Prediction layer

The model generates a prediction by passing the hidden state through a regression layer:

graphic file with name 41598_2025_5958_Article_Equ16.gif 16

where Inline graphic is a learnable weight matrix and Inline graphic is a bias term.

Transformer model with CFC(TRCFC)

The hidden state of the CfC, Inline graphic, evolves according to the differential equation from equation 10:

graphic file with name 41598_2025_5958_Article_Equ24.gif

Here, h(t) is the initial state. Each transformer block consists of self-attention and feed-forward layers with residual connections. The self-attention output for token j in block i is:

graphic file with name 41598_2025_5958_Article_Equ17.gif 17

The feed-forward update is:

graphic file with name 41598_2025_5958_Article_Equ18.gif 18

where H is the number of attention heads, and Inline graphic is the feed-forward network. The neuron state x(t) in a CfC cell is approximated the below equation22:

graphic file with name 41598_2025_5958_Article_Equ19.gif 19

where Inline graphic is the initial state, A is a constant offset, Inline graphic is a time constant, I(t) is the input at time t, and Inline graphic is a nonlinear function parameterized by Inline graphic.The continuous-time transformer model is described by:

graphic file with name 41598_2025_5958_Article_Equ20.gif 20

where Inline graphic is composed of transformer blocks as described above.

Proposed CCCFC and TRCFC models

Continuous-time neural network (CfC)

The CfC module34 models continuous-time dynamics of the input, updating it over a time interval Inline graphic. This process can be mathematically expressed as follows:

graphic file with name 41598_2025_5958_Article_Equ21.gif 21

where Inline graphic and Inline graphic are learnable weight matrices, and Inline graphic is a non-linear activation function like Inline graphic.

Common convolutional closed form continuous-time neural networks (CCCFC) model

The CCCFC model combines the strengths of the CNN and CfC models, allowing CNN to capture variations in PM2.5 concentrations across different years, while CfC effectively models the continuous characteristics of the data. The architecture of the CCCFC model is illustrated in Fig. 5. For model training, key features were extracted from the yearly data and concatenated over four years. These data were then input into a CNN1D model, followed by max-pooling with a MaxPooling1D layer, and subsequently passed through a Dense layer. The output of this sequence was fed into the Closed-form Continuous Depth (CfC)34 models. From the Fig. 5 , consists CfC model also have forget gate f, hidden gate h and input gate i .The forget gate controls how much of the previous state should be forgotten or retained and input gate determines how much new information should be added to the current state . The hidden gate decides what portion of the current memory will be passed on as the output for the next time step or for the final output of the model. The model output was then evaluated using a loss function, with gradient updates applied with convergence checks. The final output was assessed using mean squared error (MSE), a commonly used metric to evaluate model performance by calculating the average squared differences between observed values and predicted values.

Fig. 5.

Fig. 5

The architecture for CCCFC model.

Transformer-closed form continuous (TRCFC) model

The difference between TRCFC, shown in Fig. 6, and CCCFC is that TRCFC uses Transformer to replace CNN in CCCFC. The transformer-based model follows the original Transformer model22 consisting of the encoder, decoder, and attention layers. The encoder layer is composed of an input layer of time series data with vector dimension through a fully connected network, and it is essential for a multi-head attention mechanism that produces the dimensional vector as considered the input of the decoder. The decoder is also combined with an input layer, identical decoder layers, and an output layer. This output layer is entered as the input layer of closed form the continuous neural network, and those networks are completed with time distribution networks for time series hourly predictions. The RMSE values are calculated from those models and are listed in Tables 7, 8 and 11.

Fig. 6.

Fig. 6

The framework for TR-CFC model.

Table 7.

The average for all model applied in EPA stations.

Average RMSE for all stations
Model 1h 8h 16h 24h 32h 40h 48h 56h 64h 72h
BILSTM 4.63 6.26 7.37 7.68 8.02 8.24 8.38 8.49 8.24 8.8
LSTM 4.67 7.12 7.79 8.03 8.2 8.9 8.86 8.58 8.62 8.8
GRU 4.63 7.11 7.76 8.03 8.2 8.81 8.8 8.52 8.62 8.76
PCNN 4.39 7.07 7.56 8.26 8.33 8.02 8.19 8.33 8.57 8.8
ODE-LSTM 4.34 7.09 7.3 7.6 7.73 7.87 8.44 8.53 8.57 8.72
CNN-LSTM 4.31 6.31 7.1 7.46 7.77 7.95 8.09 8.27 8.43 8.66
CCCFC 4.56 6.25 6.87 7.34 7.55 7.73 7.76 8.13 8.31 8.55
TRCFC 4.33 6.46 7.19 7.57 7.86 8.02 8.07 8.14 8.27 8.47

The bold mark indicates that our proposed model performs better compared to the other model.

Table 8.

The base model and propose model performance for RMSE value.

Station name(1): Banqiao
Model 1h 8h 16h 24h 32h 40h 48h 56h 64h 72h
BILSTM 4.64 6.58 7.62 8.34 8.58 8.9 9.11 9.18 9.61 10.18
LSTM 4.61 7.25 8.3 8.71 8.55 9.5 9.58 9.71 9.73 10
GRU 4.59 7.29 8.27 8.75 8.6 9.28 9.51 9.81 9.83 9.8
PCNN 4.31 6.35 7.43 8.06 8.21 8.22 8.76 9.12 9.41 9.35
ODE-LSTM 4.35 7.28 8.34 8.68 8.79 9.03 9.17 9.18 9.18 9.48
CNN-LSTM 3.91 6.55 7.47 8.12 8.49 8.38 8.54 9.17 8.98 10.04
CCCFC 4.08 6.16 7.05 7.79 7.92 8.25 8.17 8.56 8.98 9.3
TRCFC 4.28 6.43 7.67 8.26 8.6 8.63 8.67 8.792 8.78 9.21

Table 11.

The average MAE for all models applied in EPA stations.

Average MAE for all stations
Model 1h 8h 16h 24h 32h 40h 48h 56h 64h 72h
BILSTM 3.29 4.57 5.15 5.52 5.84 6.23 6.48 6.52 6.78 7.14
LSTM 3.36 5.32 5.91 6.08 6.28 6.34 6.39 6.56 6.57 6.59
GRU 3.36 5.36 5.93 6.08 6.29 6.38 6.39 6.56 6.57 6.56
PCNN 3.08 5.35 5.9 6.18 6.43 6.56 6.6 6.68 6.72 6.69
ODE-LSTM 3.81 5.42 5.8 6.46 6.47 6.22 6.41 6.42 7.06 6.81
CNN-LSTM 3.05 4.61 5.29 5.63 5.91 6.09 6.61 6.42 6.55 6.71
CCCFC 3.24 5.05 5.61 5.83 6.05 6.2 6.24 6.47 6.5 6.52
TRCFC 3.07 4.81 5.41 5.78 6.03 6.2 6.36 6.42 6.47 6.59

Experimental design

To enhance model performance, we utilized a systematic approach: Grid Search for tuning discrete parameters like the number of layers and units, Random Search for optimizing continuous parameters such as the learning rate.

The experimental setup aims to evaluate the effectiveness of a hybrid deep learning model consisting of Conv1D, BiLSTM, and a Closed-Form Continuous-time (CfC) layer for PM2.5 prediction called CCCFC model. The model was trained and tested on PM2.5 time-series data with a well-defined preprocessing pipeline. In the Table 3 discussed about the parameters and reason of choosing those parameters for getting good results in our proposed models.

Table 3.

CCCFC model hyperparameter settings (random search).

Parameter Value
Batch size 190
Number of layers Conv1D (1), BiLSTM (1), CfC (1), Dense (1)
LSTM units [32, 64, 128]
CfC units [32, 64, 128]
Pool size [2, 3]
Dropout rate [0.2, 0.3, 0.5]
Activation functions ReLU (Conv1D, LSTM, Dense), Linear (Output)
Learning rate [1e-2, 1e-3, 1e-4]
Optimizer Adam
Loss function Mean Squared Error (MSE)
Objective metric Validation Loss (val_loss)
Max trials 10
Executions per trial 1
Epochs Up to 100 (with early stopping)
Input shape (72,10)Example
Number of parameters 45759

Hyperparameter Settings for baseline all models, CCCFC and TRCFC model

In Table 2, the hyperparameter settings for all baseline models–BILSTM, LSTM, GRU, CNN-LSTM, ODE-LSTM, and PCNN–are presented. In all cases, the models are trained with early stopping and run until the best results are obtained. The models are trained using early stopping, with the number of epochs typically reaching approximately the optimal value before training halts. The CCCFC model hyperparameters are carefully chosen to optimize the model’s performance on enhance the model performance for PM2.5 hourly predictions as shown in Table 3. The model consists of multiple layers, including a Conv1D layer for feature extraction, a Bidirectional LSTM to capture temporal dependencies, a CfC layer to model continuous-time patterns, and a Dense layer for the final output. Key hyperparameters include the number of units in the LSTM and CfC layers, the pool size for downsampling, and the dropout rate to prevent overfitting. The learning rate is adjusted during training, and Adam is used as the optimizer for efficient gradient-based updates. The model is evaluated using Mean Squared Error (MSE) as the loss function, with early stopping to prevent excessive training. The hyperparameters are tuned using Random Search, with 10 trials to explore various configurations, aiming to minimize the validation loss. This configuration ensures the model is flexible yet stable, capable of learning complex patterns in sequential data. The Table 4 shows the hyperparameters for TRCFC model which combined with Transformer and CFC-based model and details description are given. The Table 4 have CfC model consisting with forget gate f, hidden gate h and input gate i discussed in before too. Training was conducted on an NVIDIA A100 GPU with 16 GB of memory, paired with a 32-core Intel Xeon CPU and 64 GB of system RAM. Each model took approximately 1 to 4 hours to train, depending on its configuration and dataset size. In total, the hyperparameter tuning process consumed approximately 300-400 GPU/CPU hours. Model training and resource usage were monitored using nvidia-smi and logged using TensorBoard. Closed-form Continuous-time (CfC) architecture are typically optimized for CPU-only execution, particularly due to their efficient design and lack of GPU kernel support in some implementations, the description needs to reflect that accurately.

Table 2.

Hyperparameter settings for different models.

Hyperparameter BiLSTM GRU LSTM CNN-LSTM ODE-LSTM PCNN
Units / Filters 64–128 64–128 64–128 Conv: 64–100; LSTM: 64 64–128 64–128
Kernel size N/A N/A N/A 3, 5, 10 N/A 3, 5, 10
Dropout 0.1–0.5 0.1–0.5 0.1–0.5 0.1–0.5 0.1–0.5 0.1–0.5
Recurrent dropout 0.1–0.5 0.1–0.5 0.1–0.5 N/A 0.1–0.5 N/A
Learning rate 1e-4–1e-3 1e-4–1e-3 1e-4–1e-3 1e-4–1e-3 1e-4–1e-3 1e-4–1e-3
Optimizer Adam Adam Adam Adam Adam Adam
Batch size 32, 64 32, 64 32, 64 32, 64 32, 64 32, 64
Epochs 30–100 30–100 30–100 30–100 30–100 30–100
Params. 142593 53889 71279 66601 27357 64221

Table 4.

TRCFC model parameter settings.

Parameter Value
Batch size 190
Number of layers Transformer Encoder (1), CfC (1), TimeDistributed Dense (1)
Number of attention heads 4
Key dimension (key_dim) 10
Feed-forward dimension (ff_dim) 128
Hidden units (CfC Layer) 64
Activation functions ReLU (FFN), Linear (Output)
Dropout rate 0.1
Optimizer Adam
Loss function Mean squared error (MSE)
Input shape (72, 10) (Example)
Number of parameters 47157

Algorithm description

the model evolves hidden states continuously using CfC dynamics and refines them with Transformer blocks to capture long-range dependencies. The overall procedure is summarized in Algorithm 1. The input data is initialized on the model and also considers its timesteps.

Algorithm 1.

Algorithm 1

Transformer with Closed form continuous (TRCFC)

Empirical studies of PM2.5 prediction

Datasets and processing decription

For the Taipei metropolitan area, we utilized data collected from all 18 Environmental Protection Administration (EPA) air quality monitoring stations6. Information regarding these stations can be found on the EPA website (https://airtw.moenv.gov.tw/ENG/Sitemap.aspx). Our dataset includes the geographical coordinates of these stations, allowing for their spatial representation. For instance, the Xizhi station is located at latitude 320 and longitude 165.

The air quality data, along with meteorological data obtained hourly from the Central Weather Bureau (CWB) website (www.cwa.gov.tw/eng/), form the basis of our analysis. Prior to modeling, the air quality data underwent preprocessing. Outliers in PM2.5 values were managed using the Z-score method, with any value exceeding 300 Inline graphic being capped at this maximum. Missing PM2.5 values were imputed using the average PM2.5 concentration for the corresponding month. For our predictive modeling, we selected ten relevant features from these datasets. The performance of our models was evaluated using the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) metrics. The models were trained to predict PM2.5 levels using both meteorological and historical air quality data.

The training dataset encompasses hourly data from January 1, 2014, to December 31, 2019, while data from January 1, 2020, to December 31, 2020, were reserved for testing. The preprocessing of the historical data involved first identifying records with 25 or fewer features, separating them, and then extracting the relevant features for model training. Finally, the dataset was chronologically sorted for each timestep to serve as input for the models, as depicted in Fig. 7.

Fig. 7.

Fig. 7

Data processing steps of PM2.5 prediction.

Evaluation matrics

To evaluate the performance of both baseline models and the proposed model, we employed including Root Mean Square Error (RMSE), Mean Absolute Error (MAE)metrics. The Root Mean Square Error (RMSE) is calculated as:

graphic file with name 41598_2025_5958_Article_Equ25.gif

The Mean Absolute Error (MAE) is calculated as:

graphic file with name 41598_2025_5958_Article_Equ26.gif

Where Inline graphic is actual value ,Inline graphic is predicted value, Inline graphic total number of observations.

Mean absolute percentage error (MAPE) is a widely used metric to evaluate forecasting accuracy. It measures the average absolute percentage difference between the predicted and actual values and is defined as:

graphic file with name 41598_2025_5958_Article_Equ22.gif 22

A lower MAPE indicates better predictive performance. However, it may be unreliable when actual values are close to zero.

Coefficient of determination Inline graphic measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is given by:

graphic file with name 41598_2025_5958_Article_Equ23.gif 23

An Inline graphic value of 1 indicates perfect prediction, while a value of 0 implies that the model does no better than the mean of the observed data. Negative values suggest that the model performs worse than a simple mean-based prediction.

Prediction result analysis

RMSE prediction result analysis

Among the 18 EPA monitoring stations, we selected two representative ones for further analysis: Banqiao, Shilin. Banqiao lies on the outskirts of the city, and Shilin is in a suburban area, each location having unique geographic characteristics. As shown in Fig. 8a, RMSE values increase as the forecast horizon grows, with the lowest RMSE at +1 hour and the highest at +72 hours. At Banqiao station, the proposed models, CCCFC and TRCFC, outperform other models, including LSTM, BiLSTM, PCNN, and ODE-LSTM. Figure 8b shows similar results for the Guting station, where CCCFC and TRCFC also deliver superior performance compared to traditional models. In Fig. 8a,b, the results follow a consistent trend: as the forecast horizon extends from +1 hour to +72 hours, the RMSE values increase correspondingly, indicating robust performance by CCCFC and TRCFC models over time.

Fig. 8.

Fig. 8

The RMSE performance analysis in different stations.

Figure 9 illustrates the average RMSE of each model across stations at different time intervals. It is evident that our proposed CCCFC model achieves the lowest RMSE compared to ODE-LSTM, PCNN, BILSTM, and LSTM, CNN-LSTM models. Across all stations, CCCFC and TRCFC consistently outperform baseline models, reinforcing the robustness of our approach. In Table 9, the performance of BILSTM, LSTM, GRU, PCNN, CNN-LSTM and ODE-LSTM models is compared to CCCFC in terms of RMSE across prediction horizons from 1h to 72h. BILSTM exhibits slight fluctuations, with a maximum of 7.99%. LSTM shows a relatively high initial percentage of 4.74% at 1h, which increases dramatically to 15.14% at 8h, remaining stable thereafter. PCNN has a maximum change of 12.53%, with less variation over an extended time. CNN-LSTM has a maximum change of 8.66% for +72h prediction. In all cases, the proposed CCCFC and TRCFC models yield more favorable results than other models used in the study. Tables 5 and 6 have been shown the confidence intervals of average of intervals of RMSE values and we have choose the appropriate values form those tables.

Fig. 9.

Fig. 9

Average RMSE performance of different models in all stations.

Table 9.

Performance comparison with CCCFC (positive values indicate worse performance and negative values indicate better performance).

Percentage (%) increase compare with CCCFC models
Model 1h 8h 16h 24h 32h 40h 48h 56h 64h 72h
BILSTM 1.54 0.16 7.28 4.63 6.23 6.6 7.99 4.43 −0.84 2.92
LSTM 2.41 13.92 13.39 9.4 8.61 15.14 14.18 5.54 3.73 2.92
GRU 1.54 13.76 12.95 9.4 8.61 13.97 13.4 4.8 3.73 2.46
PCNN −3.73 13.12 10.04 12.53 10.33 3.75 5.54 2.46 3.13 2.92
ODE-LSTM −4.82 13.44 6.26 3.54 2.38 1.81 8.76 4.92 3.13 1.99
CNN-LSTM −5.48 0.96 3.35 1.63 2.91 2.85 4.25 1.72 1.44 1.29
TRCFC −5.04 3.36 4.66 3.13 4.11 3.75 3.99 0.12 −0.48 −0.94
Table 5.

RMSE with 95% confidence intervals (±0.10) for forecasting horizons 1h to 40h.

Model 1h 8h 16h 24h 32h 40h
BILSTM (4.53, 4.73) ±0.10 (6.16, 6.36) ±0.10 (7.27, 7.47) ±0.10 (7.58, 7.78) ±0.10 (7.92, 8.12) ±0.10 (8.14, 8.34) ±0.10
CCCFC (4.46, 4.66) ±0.10 (6.15, 6.35) ±0.10 (6.77, 6.97) ±0.10 (7.24, 7.44) ±0.10 (7.45, 7.65) ±0.10 (7.63, 7.83) ±0.10
CNN-LSTM (4.21, 4.41) ±0.10 (6.21, 6.41) ±0.10 (7.00, 7.20) ±0.10 (7.36, 7.56) ±0.10 (7.67, 7.87) ±0.10 (7.85, 8.05) ±0.10
GRU (4.53, 4.73) ±0.10 (7.01, 7.21) ±0.10 (7.66, 7.86) ±0.10 (7.93, 8.13) ±0.10 (8.10, 8.30) ±0.10 (8.71, 8.91) ±0.10
LSTM (4.57, 4.77) ±0.10 (7.02, 7.22) ±0.10 (7.69, 7.89) ±0.10 (7.93, 8.13) ±0.10 (8.10, 8.30) ±0.10 (8.80, 9.00) ±0.10
ODE-LSTM (4.29, 4.49) ±0.10 (6.97, 7.17) ±0.10 (7.46, 7.66) ±0.10 (8.16, 8.36) ±0.10 (8.23, 8.43) ±0.10 (7.92, 8.12) ±0.10
PCNN (4.24, 4.44) ±0.10 (6.99, 7.19) ±0.10 (7.20, 7.40) ±0.10 (7.50, 7.70) ±0.10 (7.63, 7.83) ±0.10 (7.77, 7.97) ±0.10
TRCFC (4.23, 4.43) ±0.10 (6.36, 6.56) ±0.10 (7.09, 7.29) ±0.10 (7.47, 7.67) ±0.10 (7.76, 7.96) ±0.10 (7.92, 8.12) ±0.10
Table 6.

RMSE with 95% confidence intervals (±0.10) for forecasting horizons 48h to 72h.

Model 48h 56h 64h 72h
BILSTM (8.28, 8.48) ±0.10 (8.39, 8.59) ±0.10 (8.14, 8.34) ±0.10 (8.70, 8.90) ±0.10
CCCFC (7.66, 7.86) ±0.10 (8.03, 8.23) ±0.10 (8.21, 8.41) ±0.10 (8.45, 8.65) ±0.10
CNN-LSTM (7.99, 8.19) ±0.10 (8.17, 8.37) ±0.10 (8.33, 8.53) ±0.10 (8.56, 8.76) ±0.10
GRU (8.70, 8.90) ±0.10 (8.42, 8.62) ±0.10 (8.52, 8.72) ±0.10 (8.66, 8.86) ±0.10
LSTM (8.76, 8.96) ±0.10 (8.48, 8.68) ±0.10 (8.52, 8.72) ±0.10 (8.70, 8.90) ±0.10
ODE-LSTM (8.09, 8.29) ±0.10 (8.23, 8.43) ±0.10 (8.47, 8.67) ±0.10 (8.70, 8.90) ±0.10
PCNN (8.34, 8.54) ±0.10 (8.43, 8.63) ±0.10 (8.47, 8.67) ±0.10 (8.62, 8.82) ±0.10
TRCFC (7.97, 8.17) ±0.10 (8.04, 8.24) ±0.10 (8.17, 8.37) ±0.10 (8.37, 8.57) ±0.10

Table 7 presents the average RMSE of different models for all stations. In all cases, our proposed models, CCCFC and TRCFC, demonstrate consistently strong performance. While each model shows some fluctuations for different stations, their performance patterns remain generally consistent. The Table 8 have shown Banqiao station’s RMSE value results that is described there. From the Table 8, it is clearly observed that our proposed model CCCFC and TRCFC performed well compared than the traditional baselines model when the number of hours are increased than the RMSE value also increases (Table 9).

RMSE value t test

The t-statistic represents the magnitude and direction of the difference, while the p-value indicates statistical significance. If Inline graphic, the difference is statistically significant, if Inline graphic, the difference is not statistically significant. The statistical comparison between the CCCFC model and several baseline models BiLSTM, LSTM, GRU, ODE-LSTM, PCNN, and CNN-LSTM was conducted using paired t-tests to determine whether the differences in performance are statistically significant. The results show that the t-statistics for all comparisons are positive and relatively high, indicating that our CCCFC model consistently outperforms the baselines across the evaluated metric RMSE. More importantly, the p-values for all tests are well below the 0.05 significance threshold, confirming that the performance improvements of CCCFC are statistically significant. For instance, the comparison with LSTM yields a t-statistic of 5.6395 and a p-value of 0.0003, strongly suggesting that CCCFC’s enhancements are not due to random chance. Even the least significant result, CCCFC vs CNN-LSTM (p = 0.0264), still shows a statistically significant advantage. These findings validate the effectiveness of the CCCFC architecture in capturing complex temporal patterns compared to traditional and hybrid sequence modeling approaches. From the Table 10 the comparison with TRCFC is not statistically significant (Inline graphic), indicating similar performance. LSTM and GRU show the most significant differences compared to CCCFC, suggesting CCCFC performs better.

Table 10.

Paired t-test results between CCCFC and other models.

Comparison t-statistic p-value Significance (p < 0.05)
CCCFC vs BILSTM 4.1193 0.0026 Yes
CCCFC vs LSTM 5.6395 0.0003 Yes
CCCFC vs GRU 5.4906 0.0004 Yes
CCCFC vs ODE-LSTM 4.0934 0.0027 Yes
CCCFC vs PCNN 3.3484 0.0085 Yes
CCCFC vs CNN-LSTM 2.654 0.026373 Yes
CCCFC vs TRCFC 2.1035 0.0648 No change

A checkmark indicates statistical significance (Inline graphic).

The MAE value result evaluation

The average MAE Inline graphic results presented in Table 11 demonstrate a consistent trend across all models: as the forecasting horizon extends from 1 hour to 72 hours, the MAE values increase. This indicates a natural decline in prediction accuracy over longer time intervals. Figure 10 further supports this observation, showing a clear upward trend in MAE as the number of forecasting hours increases. However, it is has been found that CCCFC performance in for 72h prediction is better than our baseline model.

Fig. 10.

Fig. 10

Average MAE performance of different models in all stations.

Among the evaluated models, ODE-LSTM (dark blue line) and CCCFC (green line) exhibit relatively stable and lower MAE values across various forecasting horizons. Traditional models such as BiLSTM, LSTM, and GRU (represented by blue, orange, and gray lines, respectively) follow a similar pattern, with steadily increasing MAE values over time. PCNN (yellow line) also displays a comparable upward trend.

While CNN-LSTM performs better than GRU, LSTM, BiLSTM, and PCNN, the proposed TRCFC and CCCFC models consistently outperform the baseline models, especially at shorter forecasting horizons. However, they exhibit slightly more variation as the forecasting window expands. Figure 11a,b, which represent results for the Banqiao and Xizhi stations respectively, confirm that CCCFC and TRCFC achieve superior performance in terms of MAE when compared to traditional baseline models, including LSTM, BiLSTM, ODE-LSTM, and PCNN, CNN-LSTM.

Fig. 11.

Fig. 11

The MAE performance comparison in different stations.

The MAPE and RInline graphic value result evaluation

Table 12 shows that the proposed models, CCCFC and TRCFC, achieve notably lower MAPE percentages compared to the baseline approaches (Table 13). Among all models, CCCFC delivers the best performance for the 72-hour prediction horizon, recording the lowest MAPE value. As illustrated in Table 14, MAPE values generally increase as the forecasting window extends. This table presents results for models including BILSTM, LSTM, GRU, CNN-LSTM, PCNN, and ODE-LSTM. Both CCCFC and TRCFC consistently outperform these methods. Additionally, the CCCFC model, demonstrates the lowest MAPE overall, highlighting its strong predictive accuracy for extended forecast periods. As shown in Table 13, the RInline graphic values tend to decrease as the prediction horizon increases, indicating reduced correlation between predicted and actual values over longer timeframes. Despite this trend, the proposed CCCFC model consistently outperforms traditional models such as LSTM, GRU, BILSTM, ODE-LSTM, CNN-LSTM, and PCNN, particularly in the 72-hour prediction. It is evident that CCCFC maintains strong performance across all time horizons–1h, 24h, 48h, and 72h–achieving lower R² values compared to the baseline models. This demonstrates the robustness and effectiveness of our proposed approach in both short- and long-term forecasting scenarios.

Table 12.

MAPE values at Banqiao Station (1h, 24h, 48h, 72h).

Model 1h 24h 48h 72h
BILSTM 29.08 47.89 72.07 79.94
LSTM 30.22 60.76 70.78 79.51
GRU 30.53 60.04 72.56 78.68
ODE-LSTM 25.50 71.10 79.23 80.86
PCNN 25.88 56.84 69.75 78.04
CNN-LSTM 23.39 70.09 69.89 97.05
CCCFC 27.30 65.27 71.68 73.83
TRCFC 39.69 58.27 62.76 74.86

Table 13.

R2 values at Banqiao Stations (1h, 24h, 48h, 72h).

Model 1h 24h 48h 72h
BILSTM 0.75 0.43 0.24 0.22
LSTM 0.74 0.26 0.25 0.22
GRU 0.75 0.39 0.22 0.21
ODE-LSTM 0.80 0.42 0.25 0.17
PCNN 0.81 0.31 0.24 0.18
CNN-LSTM 0.82 0.28 0.22 0.18
CCCFC 0.72 0.18 0.10 0.11
TRCFC 0.67 0.25 0.13 0.08

Table 14.

Model performance over different forecasting for MAPE value.

Model 1h 8h 16h 24h 32h 40h 48h 56h 64h 72h
BILSTM 29.08 41.40 48.76 47.89 56.68 61.21 72.07 74.49 75.95 79.94
LSTM 30.22 52.24 62.64 60.76 66.28 67.60 70.78 75.15 75.62 79.51
GRU 30.53 50.68 64.24 60.04 66.97 70.70 72.56 73.96 75.22 78.68
ODE-LSTM 25.50 48.30 61.76 71.10 73.60 73.95 79.23 81.34 81.10 80.86
PCNN 25.88 44.75 56.31 56.84 63.57 69.42 69.75 75.33 77.27 78.04
CNN-LSTM 23.39 44.40 54.29 70.09 74.28 65.86 69.89 86.14 87.56 97.05
CCCFC 27.30 51.73 63.55 65.27 70.32 71.19 71.68 72.91 73.83 73.83
TRCFC 39.69 52.71 51.70 58.27 56.61 59.72 62.76 68.55 73.86 74.86

Loss cCurve

Figure 12 illustrates the training and validation loss curves for all models, including GRU, LSTM, BILSTM, ODE-LSTM, PCNN, and CNN-LSTM shows on the Fig. 12a–e. The results indicate that all models were trained effectively. Among them, the proposed TRCFC and CCCFC models exhibit the most favorable performance in terms of both training and validation loss, as further detailed in Fig. 12g,h. While models like PCNN, ODE-LSTM, and TRCFC display slight fluctuations in their validation curves, the CCCFC model consistently outperforms the others, demonstrating superior generalization and stability.

Fig. 12.

Fig. 12

The training loss and validation loss curve.

Deployment feasibility

For real-time air quality forecasting applications, practical aspects such as hardware requirements, computa- tional costs, and inference latency must be carefully considered. Most of the deep learning models evaluated in this study, including LSTM, GRU, CNN-based, and ODE-based architectures, can be trained effectively on a single modern GPU (e.g., NVIDIA RTX 3080 or A100), with training times ranging from several minutes to a few hours depending on model complexity and input sequence length. Lightweight models such as vanilla LSTM and GRU offer fast training and low inference latency, making them well-suited for edge deployment or integration with mobile and IoT devices. Hybrid architectures like CNN-LSTM and ODE-LSTM provide improved long-range forecasting accuracy but typically incur higher computational costs. Specifically, ODE-based models introduce numerical solvers into the learning process, which increases training time and memory usage due to the need for fine-grained temporal integration steps. However, once trained, these models can maintain competitive inference speeds, typically within hundreds of milliseconds per sample, making them viable for near-real-time applications with proper system optimization.

Limitations

Model performance often degrades over longer forecasting horizons due to several interrelated factors. First, as the prediction window increases (e.g., from 1 hour to 72 hours), models rely more heavily on their own previous outputs, which may contain small errors that accumulate and compound over time, leading to significant deviations from the true values. Second, while short-term PM2.5 levels are typically governed by strong temporal correlations, longer horizons are more influenced by unpredictable external factors such as abrupt weather changes, traffic conditions, industrial emissions, or sudden atmospheric disturbances, which introduce additional noise and reduce the signal-to-noise ratio. Third, many models, such as vanilla LSTM or GRU, may not have sufficient capacity or architectural complexity to capture long-range temporal dependencies or multi-scale patterns, especially in the presence of nonlinear or chaotic environmental processes. In contrast, advanced architectures like ODE-LSTM or CNN-LSTM attempt to mitigate this by modeling physical or spatial dynamics, though they still face challenges when external influences dominate. Moreover, datasets often contain fewer clean and labeled samples for longer-term forecasting, reducing the model’s ability to generalize. Feature-target relationships may also shift over time, resulting in feature drift, where input features become less predictive at distant horizons. Finally, PM2.5 levels are inherently volatile due to their dependence on a wide range of partially observed or unmeasured environmental variables, making long-term predictions increasingly uncertain. These factors together contribute to the progressive performance degradation observed in most models as the forecasting horizon increases.Despite the promising results, this study has several limitations. Notably, the CfC (Closed-form Continuous-time) units used in the model may still encounter challenges related to gradient instability, particularly in deeper architectures or long sequence modeling. Additionally, the current implementation is limited in its evaluation scope and can be further extended to accommodate both univariate and multivariate time series data for broader applicability.

Discussions about the impact of the enhanced CCFC model performance

The predictive analytics capabilities of the CCCFC model empower authorities to anticipate pollution spikes and take timely action before air quality significantly deteriorates, thereby shifting environmental management from a reactive to a proactive approach. This proactive strategy enables the implementation of targeted interventions, such as issuing early public health advisories, adjusting traffic flows, or temporarily restricting industrial emissions, to mitigate the risk of hazardous air conditions and protect public health. The enhanced performance of the CCCFC model improves the estimation of population exposure to air pollutants, which is crucial for epidemiological studies investigating the relationship between air pollution and health outcomes. By providing more accurate exposure assessments, the model enables better identification of at-risk populations and supports more precise estimation of pollution-related health burdens.By preventing pollution spikes and guiding effective interventions, predictive models contribute to a reduction in pollution-related illnesses and mortality, while also helping to lower associated healthcare costs.

Conclusion and future directions

In this work, it has been predicted the Taipei area of Taiwan, has 18 EPA stations and it is believed that the proposed model can be applied to any other area also will work for any other time series data sets. The model performance will help the policy maker to make the decision for the traffic that has expected high levels of PM2.5 pollution and also can give the warning for the people which area maximum PM2.5 values. In this work, we compare CCCFC and TRCFC models with other basic models as CFC models previously didn’t apply to pollution data hourly prediction. The CCCFC model outperforms other models, as evidenced by the percentage error variations from 13.92–2.92%, 7.28–2.92%, to 0.16–2.92% for the GRU, LSTM, and BILSTM models, respectively, across the prediction intervals from 8 hours to 72 hours. In comparison with CCCFC with PCNN, ODE-LSTM, and TRCFC for the 8h-72h prediction it has been that the model performance varied respectively from (13.12–2.92)%, (13.44–1.99)%, 3.36% to -.94%. In the future, CFCs will perform better if we combine CFCs with variant memory, such as CFC-mmRNN and CFC-nogate. And might be other forms of ODE-based neural network can be more effective than closed-form solutions. To further advance environmental forecasting, future research should prioritize enhancing the adaptability, scalability, and applicability of deep learning models across a wider spectrum of pollutants and environmental contexts. Beyond PM2.5, these models can be extended to predict other critical pollutants such as ozone (O3), nitrogen dioxide (NO2), carbon monoxide (CO), and even emerging threats like microplastics. Applying these models in diverse geographical regions–including urban, rural, coastal, and industrial areas–will help assess their robustness under varying environmental conditions, especially when integrated with domain adaptation techniques. Incorporating real-time data streams from IoT sensor networks, mobile air quality monitors, and satellite imagery can significantly improve the models’ responsiveness to dynamic environmental events such as wildfires, dust storms, and industrial incidents. Moreover, techniques like transfer learning and self-supervised learning offer promising avenues for improving model performance in data-scarce regions by leveraging knowledge from high-data settings. Hybrid modeling strategies that combine physics-based environmental models with deep learning approaches may further enhance both prediction accuracy and interpretability. Expanding these frameworks to forecast other phenomena–such as algal blooms, water pollution, or vector-borne disease outbreaks–could amplify their societal impact. Lastly, incorporating uncertainty quantification and explainable AI techniques will be essential for building trust and transparency in model predictions, enabling more actionable insights for policymakers, public health officials, and environmental stakeholders. Realizing these goals will require interdisciplinary collaboration across data science, environmental science, and public policy to translate technological advancements into effective, real-world solutions.

The proposed model CCCFC and TRCFC demonstrate superior accuracy in predicting hourly PM2.5 concentrations compared to traditional methods. This improvement in prediction accuracy provides a reliable foundation for air quality management strategies and contributes to safeguarding public health and also be possible safe time and economical.

Supplementary Information

Acknowledgements

I would like to thank Shafiq Khan who discussed with me the CFC model and its performances. We also thank the website of the Central Weather Bureau(CWB) (opendata.cwb.gov.tw) where the hourly data are collected.

Author contributions

Md Khalid Hossen has contributed to text writing, data collection, architecture design, model performance analysis, and model evaluation and designed Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11 and Tables 1, 2, 3, 4, 5, 6 and 7. Yan-Tsung Peng helps with the project supervision and guidance. Meng Chang Chen supervises, guides, writes, edits, and Asher Shao helps with the time series data and model performance issues.

Data availability

Data can be downloaded from the below link: https://www.kaggle.com/datasets/khalid27/taipei

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-05958-2.

References

  • 1.Andrade, M. et al. Air pollution and health-a science-policy initiative. Ann. Glob. Health85(1), 140 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang, B. et al. Estimate hourly PM2.5 concentrations from himawari-8 toa reflectance directly using geo-intelligent long short-term memory network. Environ. Pollut.271, 116327 (2021). [DOI] [PubMed] [Google Scholar]
  • 3.Xing, Y.-F., Xu, Y.-H., Shi, M.-H. & Lian, Y.-X. The impact of PM2. 5 on the human respiratory system. J. Thorac. Dis.8, E69 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang, M.-C., Wong, G.-W. & Chen, M. C. Sparse grid imputation using unpaired imprecise auxiliary data: Theory and application to PM2.5 estimation. ACM Trans. Knowl. Discov. Data10.1145/3634751 (2024). [Google Scholar]
  • 5.Yang, H.-C., Yang, M.-C., Wong, G.-W. & Chen, M. C. Extreme event discovery with self-attention for PM2.5 anomaly prediction. IEEE Intell. Syst.38, 36–45. 10.1109/MIS.2023.3236561 (2023). [Google Scholar]
  • 6.Hossen, M. K., Peng, Y.-T. & Chen, M. C. Enhancing PM2. 5 prediction by mitigating annual data drift using wrapped loss and neural networks. PLoS ONE20, 0314327 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Shang, Z., Deng, T., He, J. & Duan, X. A novel model for hourly PM2. 5 concentration prediction based on CART and EELM. Sci. Total Environ.651, 3043–3052 (2019). [DOI] [PubMed] [Google Scholar]
  • 8.Ma, J. et al. A Lag-FLSTM deep learning network based on Bayesian optimization for multi-sequential-variant PM2. 5 prediction. Sustain. Cities Soc.60, 102237 (2020). [Google Scholar]
  • 9.Jia, P.-T., He, H.-C., Liu, L. & Sun, T. Overview of time series data mining. Jisuanji Yingyong Yanjiu/ Appl. Res. Comput.24, 15–18 (2007). [Google Scholar]
  • 10.Pak, U. et al. Deep learning-based PM2. 5 prediction considering the spatiotemporal correlations A case study of Beijing, china. Sci. Total Environ.699, 133561 (2020). [DOI] [PubMed] [Google Scholar]
  • 11.Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. Time series analysis: Forecasting and control (John Wiley & Sons, Hoboken, 2015). [Google Scholar]
  • 12.Lu, D., Mao, W., Xiao, W. & Zhang, L. Non-linear response of PM2. 5 pollution to land use change in China. Remote. Sens.13, 1612 (2021). [Google Scholar]
  • 13.Bekkar, A., Hssina, B., Douzi, S. & Douzi, K. Air-pollution prediction in smart city, deep learning approach. J. Big Data8, 1–21 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Suleiman, A., Tight, M. & Quinn, A. Applying machine learning methods in managing urban concentrations of traffic-related particulate matter (PM10 and PM2. 5). Atmos. Pollut. Res.10, 134–144 (2019). [Google Scholar]
  • 15.Sharma, D. K., Varshney, R. P., Agarwal, S., Alhussan, A. A. & Abdallah, H. A. Developing a multivariate time series forecasting framework based on stacked autoencoders and multi-phase feature. Heliyon10, e27860 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Tian, Y. et al. Integrating spatial-temporal features into prediction tasks: A novel method for identifying the potential water pollution area in large river basins. J. Environ. Manage.373, 123522 (2025). [DOI] [PubMed] [Google Scholar]
  • 17.Varshney, R. P. & Sharma, D. K. Optimizing time-series forecasting using stacked deep learning framework with enhanced adaptive moment estimation and error correction. Expert Syst. Appl.249, 123487 (2024). [Google Scholar]
  • 18.Kristiani, E. et al. PM2. 5 forecasting model using a combination of deep learning and statistical feature selection. IEEE Access9, 68573–68582 (2021). [Google Scholar]
  • 19.Varshney, R. P. & Sharma, D. K. Enhancing stock market prediction through image encoding, pattern recognition, and ensemble learning with custom error correction techniques. Int. J. Comput. Vis. Robot.14, 654–676 (2024). [Google Scholar]
  • 20.Iftikhar, H., Qureshi, M., Zywiołek, J., López-Gonzales, J. L. & Albalawi, O. Short-term PM 2.5 forecasting using a unique ensemble technique for proactive environmental management initiatives. Front. Environ. Sci.12, 1442644 (2024). [Google Scholar]
  • 21.Tong, W., Limperis, J., Hamza-Lup, F., Xu, Y. & Li, L. Robust transformer-based model for spatiotemporal PM 2.5 prediction in California. Earth Sci. Inf.17, 315–328 (2024). [Google Scholar]
  • 22.Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems (2017).
  • 23.Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • 24.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 25.Chen, L. et al. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural. Inf. Process. Syst.34, 15084–15097 (2021). [Google Scholar]
  • 26.Waqas, A., Tripathi, A., Ramachandran, R. P., Stewart, P. & Rasool, G. Multimodal data integration for oncology in the era of deep neural networks: A review. arXiv preprint arXiv:2303.06471 (2023). [DOI] [PMC free article] [PubMed]
  • 27.Inam, S. A., Iqbal, D., Hashim, H. & Khuhro, M. A. An empirical approach towards detection of tuberculosis using deep convolutional neural network. Int. J. Data Min., Model. Manag.16, 101–112 (2024). [Google Scholar]
  • 28.Kibirige, G., Huang, C. C., Liu, C. L. & Chen, M. C. Influence of land-sea breeze on pm prediction in central and southern Taiwan using composite neural network. Sci. Rep.13(1), 3827 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I. & Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).
  • 30.Hasani, R., Lechner, M., Amini, A., Rus, D. & Grosu, R. Liquid time-constant networks. In: Proceedings of the AAAI Conference on Artificial Intelligence35, 7657–7666 (2021).
  • 31.Dupont, E., Doucet, A. & Teh, Y. W. Augmented neural odes. Advances in neural information processing systems32 (2019).
  • 32.Yang, G. et al. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, 4541–4550 (2019).
  • 33.Rubanova, Y., Chen, R. T. & Duvenaud, D. K. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems32 (2019).
  • 34.Hasani, R. et al. Closed-form continuous-time neural networks. Nat. Mach. Intell.4, 992–1003 (2022). [Google Scholar]
  • 35.Zhang, H., Wang, Z. & Liu, D. A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Trans. Neural Netw. Learn. Syst.25, 1229–1262 (2014). [Google Scholar]
  • 36.Weinan, E. A proposal on machine learning via dynamical systems. Commun. Math. Stat.1, 1–11 (2017). [Google Scholar]
  • 37.Lu, Z., Pu, H., Wang, F., Hu, Z. & Wang, L. The expressive power of neural networks: A view from the width. Advances in neural information processing systems30 (2017).
  • 38.Li, Q., Chen, L., Tai, C. & Weinan, E. Maximum principle based algorithms for deep learning. J. Mach. Learn. Res.18, 1–29 (2018). [Google Scholar]
  • 39.Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems31 (2018).
  • 40.Lechner, M. & Hasani, R. Learning long-term dependencies in irregularly-sampled time series. arXiv preprint arXiv:2006.04418 (2020).
  • 41.Massaroli, S. et al. Stable neural flows. arXiv preprint arXiv:2003.08063 (2020).
  • 42.Erichson, N. B., Azencot, O., Queiruga, A., Hodgkinson, L. & Mahoney, M. W. Lipschitz recurrent neural networks. arXiv preprint arXiv:2006.12070 (2020).
  • 43.Li, X., Wong, T.-K. L., Chen, R. T. & Duvenaud, D. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, 3870–3882 (PMLR, 2020).
  • 44.Gleeson, P., Lung, D., Grosu, R., Hasani, R. & Larson, S. D. c302: A multiscale framework for modelling the nervous system of Caenorhabditis elegans. Philos. Trans. R. Soc. B: Biol. Sci.373, 20170379 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kim, J. & Moon, N. Bilstm model based on multivariate time series data in multiple field for forecasting trading area. J. Ambient. Intell. Humaniz. Comput.10.1007/s12652-019-01398-9 (2019). [Google Scholar]
  • 46.Zhai, N., Yao, P. & Zhou, X. Multivariate time series forecast in industrial process based on xgboost and gru. In 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), vol. 9, 1397–1400 (IEEE, 2020).
  • 47.Yamak, P. T., Yujian, L. & Gadosey, P. K. A comparison between arima, lstm, and gru for time series forecasting. In Proceedings of the 2019 2nd international conference on algorithms, computing and artificial intelligence, 49–55 (2019).
  • 48.Inam, S. A. et al. PR-FCNN: A data-driven hybrid approach for predicting PM2. 5 concentration. Discov. Artif. Intell.4, 75 (2024). [Google Scholar]
  • 49.Inam, S. A. et al. A novel deep learning approach for investigating liquid fuel injection in combustion system. Discov. Artif. Intell.5, 32 (2025). [Google Scholar]
  • 50.Zhu, M. & Xie, J. Investigation of nearby monitoring station for hourly PM2. 5 forecasting using parallel multi-input 1D-CNN-biLSTM. Expert Syst. Appl.211, 118707 (2023). [Google Scholar]
  • 51.Coelho, C., Costa, M. & Ferrás, L. L. Enhancing continuous time series modelling with a latent ODE-LSTM approach. Appl. Math. Comput.475, 128727 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Data can be downloaded from the below link: https://www.kaggle.com/datasets/khalid27/taipei


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES