Abstract
Accurate and interpretable prediction of the Air Quality Index (AQI) is critical for public health decision-making and environmental policy enforcement. This study presents a hybrid forecasting framework that combines the strengths of Random Forest Regression (RFR) and Autoregressive Integrated Moving Average (ARIMA) models to improve AQI prediction accuracy while maintaining model transparency. The RFR captures nonlinear relationships among pollutants, while ARIMA is used to model the temporal patterns in RFR residuals, forming a two-stage learning architecture. The model is trained and evaluated on multi-year AQI data from India and validated using an expanding window cross-validation strategy to maintain temporal integrity. To ensure transparency and interpretability, the study employs SHAP ((SHapley Additive Explanations) to uncover the influence of key pollutants such as PM₂.₅, NO₂, and SO₂. Additionally, Ljung-Box diagnostics and uncertainty bands are used to validate model adequacy. Compared to baseline models, the hybrid approach achieves lower Mean Squared Error (MSE = 508.46) and a higher R² score (0.94), confirming improved generalization. This research contributes a replicable, explainable, and efficient AQI forecasting framework suited for deployment in resource-constrained urban environments. The method comprises of:
Residual learning hybrid model: Random Forest for prediction + ARIMA for residual correction
Time-aware validation using expanding window cross-validation
Model interpretability through SHAP analysis
Keywords: Air quality index (AQI), Hybrid machine learning, Random forest regressor (RFR), Autoregressive integrated moving average (ARIMA), Explainable AI (XAI), SHapley additive explanations (SHAP)
Graphical abstract
Specifications table
| Subject area | Computer Science |
|---|---|
| More specific subject area | Air Quality Index forecasting |
| Name of your method | Hybrid Random Forest Regressor with ARIMA residual correction and model explainability |
| Name and reference of original method | None |
| Resource availability | Dataset used can be found here: Dataset Any other details can be provided on request. |
Background
Air pollution has emerged as one of the most pressing global environmental challenges, primarily driven by accelerated industrialization, urban expansion, and increased anthropogenic activity. The adverse effects of air pollution are multifaceted, impacting public health, environmental ecosystems, and contributing to global climate change. Recognizing its urgency, the United Nations has included air quality as a priority area under its Sustainable Development Goals (SDGs), notably SDG 3 (Good Health and Well-being), SDG 11 (Sustainable Cities and Communities), and SDG 13 (Climate Action) [1]. Among various tools used to monitor pollution levels, the Air Quality Index (AQI) remains the most widely adopted standard for quantifying ambient air pollution. It aggregates concentrations of key pollutants—PM2.5, PM10, NO₂, SO₂, CO, and O₃—into a single metric that reflects overall air quality severity [2]. Long-term exposure to these pollutants is strongly associated with respiratory diseases, cardiovascular complications, and premature mortality [3,4].
Accurate AQI prediction remains a computationally complex task due to the spatiotemporal variability and nonlinear dynamics of air pollutants [5]. Traditional statistical models, such as AR and ARIMA, are suitable for modeling temporal trends but assume linearity and stationarity, limiting their ability to capture nonlinear environmental interactions [6]. To address these limitations, machine learning (ML) techniques like Random Forest Regression (RFR), Support Vector Machines (SVM), and ensemble learners have gained traction due to their flexibility in learning from large, unstructured datasets [7,8]. These models typically focus on prediction performance but lack transparency in decision-making, which has led to the growing interest in hybrid approaches.
Recent studies have explored the fusion of ML and time series models to enhance forecasting precision. For instance, hybrid deep learning models combining CNN-LSTM or GRU architectures have shown improved accuracy [9,10], while XGBoost with feature engineering has also demonstrated notable results [11]. However, these models often lack interpretability, which is crucial for environmental and policy applications. Moreover, explainable AI (XAI) is emerging as a critical component of trustworthy ML, particularly in domains like public health and environmental science. Techniques such as SHAP and LIME help in demystifying model decisions, which is essential for user confidence, policy adoption, and regulatory compliance [[12], [13], [14]].
In contrast to prior works, this study aims to develop a robust, accurate, and interpretable AQI prediction model by combining Random Forest Regressor (RFR) with ARIMA. While our approach builds on established methods - Random Forest and ARIMA, the novelty lies in their purposeful integration, enhanced with explainability to meet the real-world demands of air quality management. Unlike black-box deep learning models that offer predictive power at the expense of transparency, our hybrid framework strikes a balance between accuracy and interpretability. Although Random Forest and ARIMA originate from different modeling paradigms, their design choice is intentional, and the combination is theoretically sound when employed in a residual learning framework. The RF model first captures the complex nonlinear interactions between pollutant features and AQI levels. However, due to its lack of sequential memory, RF may leave temporally correlated errors unmodeled. The ARIMA model is then applied to these residuals to model their autocorrelation, effectively correcting temporal bias. This two-stage approach is not a naive ensemble, but a layered system where each model contributes a distinct and non-redundant strength. Cross-validation and residual diagnostics confirm that this architecture improves forecasting reliability without introducing overfitting, offering a balance between performance and interpretability that is well-suited for environmental forecasting applications.
The remainder of the paper is organized as follows: the methodology, including data preprocessing, model architecture, and explainability techniques are explained in Method details section. Section 2 reviews related work on AQI forecasting and hybrid modeling. The Method validation section presents experimental results and performance comparisons followed by method limitation and future directions in the Limitations section.
Method details
Various techniques leveraged by researchers in recent times with notable results and performance are captured in Table 1. Though there has been notable advancement in AQI forecasting with individual statistical and machine learning models, there exists a research gap in the limited evaluation of hybrid models that leverage strengths of both paradigms. Some current studies have mainly used autoregressive time series models (e.g., ARIMA) or machine learning models (e.g., Random Forest, SVR, LSTM) separately. However, standalone models of these kinds are often insufficient in simultaneously capturing the nonlinear interactions between the pollutants and time relationships of the AQI data. This study bridges this gap by proposing a new hybrid model that integrates Random Forest Regression (RFR) with ARIMA, where the residuals of the RFR forecasting are modeled using ARIMA. This dual-model approach allows for more accurate and robust AQI forecasting by leveraging the strengths of both statistical and machine learning techniques. Additionally, while recent research has begun incorporating explainability through methods such as SHAP and LIME, these are primarily applied to single-model architectures like XGBoost or LightGBM. There remains a significant lack of interpretable ensemble or hybrid models in the context of AQI forecasting. Our research not only introduces a hybrid predictive framework but also sets the foundation for integrating SHAP explainable AI (XAI) technique, thereby contributing to transparent, reliable, and actionable environmental intelligence systems. The key components and steps used in this research are:
Table 1.
Literature survey results.
| Ref | Methodology / Model | Dataset / Location | Key Findings / Performance |
|---|---|---|---|
| [16] | Random Forest Regression (RFR) | AQI and NOx data | High R² and correlation values; effective in pollutant correlation modeling. |
| [17] | Support Vector Regression (SVR) with RBF kernel | EPA data, California | SVR predicted AQI with 94.1 % accuracy across AQI categories. |
| [18] | ARIMA and Linear Regression | Chennai | Both models effective; ARIMA slightly better in time series forecasting. |
| [19] | ANN + Kriging | Mumbai, Navi Mumbai | ANN outperformed regression models; high R values. |
| [20] | Decision Tree, Linear Regression, SVR, RFR | Indian cities | RFR gave best results with R² = 0.99985, MSE = 0.00013. |
| [21] | MLR + AHP MCDM method | India | Hybrid model improved accuracy using weighted features. |
| [22] | Decision Tree, LR, RFR + EDA + SMOTER | Smart cities | Decision Tree outperformed others in MAE and R²; EDA boosted accuracy. |
| [23] | Grey Wolf Optimizer + Decision Tree | Kaggle (Delhi, Chennai, etc.) | Superior performance vs. standard ML; suggested DL for further improvements. |
| [24] | ANN | Ahvaz, Iran | ANN predicted AQI and AQHI with high temporal accuracy. |
| [25] | Ensemble Learning (Bagging & Boosting) + PCA | Lucknow | Ensemble models outperformed SVM; PCA identified key pollutant sources. |
| [26] | Nonlinear AR Neural Network (ARNN) with Exogenous Inputs | London | Accurate AQI forecast using weather + pollutant input; designed for short-term prediction. |
| [27] | SVM, ANN, Decision Trees | Delhi (CPCB data) | SVM performed best in AQI prediction for smart city applications. |
| [28] | K-Means, PFCM | India | Enhanced clustering method gave better AQI estimates; less computation time. |
| [29] | GA, MELM, LSSVM, MLP + PSO | Energy sector dataset | ML algorithms used in environmental variables, adaptable to AQI forecasting. |
| [30] | Hybrid LSTM-GRU | Delhi (PM2.5 AQI data) | MAE = 36.11; R² = 0.84 — hybrid model superior to standalone DL models. |
| [31] | B-WEMA vs WMA, EMA, BDES | Central Jakarta | B-WEMA achieved better AQI forecast accuracy than baseline smoothing techniques. |
| [32] | LSTM + ML Techniques | Dhaka | Single-variable (temperature) LSTM model predicted AQI with high precision. |
| [33] | XGBoost + SHAP (Explainable Boosting Model) | Beijing, China | Integrated SHAP to interpret feature influence; PM2.5 and NO₂ found to be most influential. |
| [34] | LightGBM + LIME/SHAP | 5 Chinese cities (multi-source data) | Achieved RMSE < 18; model interpretability enhanced using SHAP visualizations for stakeholders |
Dataset: The datasets used in this study are from Kaggle [15] and comprise of:
-
•
city_day.csv – Daily air quality measurements aggregated by city
-
•
city_hour.csv – Hourly air quality measurements aggregated by city
-
•
station_day.csv – Daily air quality readings from individual monitoring stations
-
•
station_hour.csv – Hourly air quality readings from individual monitoring stations.
-
•
stations.csv – Metadata about monitoring stations (e.g., location, city, station name) areas, including pollutant concentrations and corresponding AQI values.
Data Pre-processing: This is performed through pre-processing on the acquired datasets to handle missing values and inconsistencies. It includes data cleaning, normalization, and feature engineering to make sure the quality and suitability of the data for model training.
Model Selection: Explores various machine learning models suitable for time series forecasting, such as autoregressive integrated moving average (ARIMA), Random Forest Regression (RFR) and hybrid RFR+ARIMA.
Explainable AI (XAI): To enhance transparency and support accountability in AQI forecasting, we employed Explainable AI (XAI) method, specifically SHAP (SHapley Additive exPlanations) to unpack the internal behavior of the hybrid RFR+ARIMA model. The tool demonstrates that the hybrid model is not only accurate but also transparent and interpretable—attributes essential for environmental policy alignment and public health advocacy.
Model Training: Trained the chosen model with the pre-processed data, model architecture and best hyperparameters to deliver the best performance. Employed methods like cross-validation and grid search for model tuning and avoiding overfitting.
Model Evaluation: Evaluated the model performance on unseen test data to gauge its ability to predict AQI values considering different time periods and locations. Estimated accuracy using metrics such as mean absolute error (MAE), root mean square error (RMSE), and correlation coefficients.
The procedure is explained in full below.
Machine learning models
Two popular algorithms in Machine Learning that are utilized in predicting Air Quality Index (AQI) values effectively are the Random Forest Regressor and ARIMA.
Random Forest Regressor: For the regression problem, the Random Forest Regressor is a general ensemble learning algorithm that generates numerous decision trees while training and provides the average of all the trees. It is not vulnerable to overfitting and can treat big data with high-dimensional features. Random Forest Regressor is an effective tool in identifying complex interactions between many pollutants and between AQI values in AQI prediction. It is a useful tool in AQI prediction because it can treat nonlinear interactions and feature importance.
Autoregressive Integrated Moving Average (ARIMA): ARIMA models are a form of time-series models of autoregressive, differencing, and moving average. ARIMA models are most appropriate to represent temporal relationships and patterns in sequential data and therefore suitable for applications of forecasting where past trends hold importance. In the forecasting of AQI, ARIMA models can efficiently represent the temporal relationships and seasonality of air quality information, e.g., daily and seasonal patterns of pollutant concentrations. Based on past AQI data, ARIMA models can efficiently forecast future air quality.
By combining the strengths of both models, the hybrid model enhances the overall prediction performance and offers a robust and reliable prediction system for the prediction of AQI values, thus enabling environmental monitoring and public health management activities.
Hybrid model: RFR + arima
For additional enhancement of the predictive ability of the suggested framework, we propose a hybrid modeling approach where we utilized the ARIMA model to model the residuals of the Random Forest Regressor (RFR). The hybridization takes advantage of the strength of the nonlinear modeling ability of RFR and the temporal dependency modeling ability of ARIMA.
The RFR model was initially trained on pollutant concentration and temporal attributes to achieve baseline AQI predictions. We also computed the residuals as the difference between predicted AQI value and the observed value on the training set. The residuals, reflecting the temporal patterns not reflected by RFR, were used to train an ARIMA model.
The ARIMA model (order = (p, 2, q)) was subsequently trained on these residuals and used to predict residual values in the test period. The final AQI predictions were subsequently obtained by adding these ARIMA-based residual predictions to the original RFR predictions for the test set:
| (1) |
Explainable AI (XAI)
SHAP Analysis — SHAP (SHapley Additive Explanations) is a powerful Explainable AI (XAI) technique that quantifies the contribution of each feature in a machine learning model. In AQI prediction, SHAP helps determine how factors like PM2.5, PM10, NO₂, SO₂, CO, and O₃ influence the predicted air quality value. It helps policymakers to make data-driven decisions to optimize pollution mitigation strategies. The SHAP value for a feature xi is computed using the formula:
| (2) |
where, F is the set of all features, is a subset of features excluding xi and represents the model’s prediction based on subset . SHAP helps in feature importance by identifying which features contribute most to air quality. It determines positive or negative impact to check whether a feature increases or decreases the predicted AQI. It helps check interaction effects by showing how multiple features combine to affect AQI outcomes.
Data preparation and pre-processing
The dataset used here includes missing values, requiring a data cleaning phase. In this phase, the missing values are removed, and the cleaned data is prepared for experimentation. Nevertheless, it's important to highlight that the AQI bucket attribute encompasses six distinct categories. Consequently, following the removal of null values, the distribution of data among these categories may become uneven, particularly for the chosen major cities.
-
•
Pre-processing for Random Forest Regressor: In this research, rigorous data pre-processing is essential to ensure the integrity and reliability of our analysis. To address missing values in the pollutant concentration dataset, a conservative two-step imputation strategy was adopted. First, a 5-day centered rolling mean was used to impute missing entries, enabling the preservation of local temporal trends while reducing the risk of over-smoothing abrupt but genuine variations. This was followed by limited backfilling, applied only at the sequence edges to avoid temporal drift. To minimize bias, features exhibiting >40 % missingness were excluded from modeling altogether. The imputation procedure was carefully monitored by plotting pollutant time series before and after processing to ensure no distortion of underlying seasonal or daily patterns. Importantly, because the ARIMA layer in our hybrid model is responsible for capturing residual temporal dynamics, it implicitly corrects for minor inconsistencies arising from imputation. This approach ensures the model remains robust and that the integrity of time-sensitive patterns is preserved throughout the forecasting pipeline. Pre-processing of ARIMA: Aside from pre-processing for Random Forest Regressor, there were some additional data processing methods employed to ensure that the data was compatible with the ARIMA model. We have further processed the air quality index (AQI) data from the original dataset for analysis. First, we created a pivot table, where we grouped the values of AQI by city and by date. This allowed us to analyze the trend of AQI with time for each city separately. Second, we used the 'MS' rule and resampled the data into the frequency of 'month', thus allowing us to calculate average values of AQI by each city by each month. This resampling allowed us to decrease granularity in the data without losing temporal patterns. Third, we calculated the average AQI value by the total country by taking the average of the values of all cities by each month. This aggregate value, termed 'India_AQI', is the representative air quality indicator at the national level. By this composite value of AQI, we understood the overall air quality condition in India during the considered time interval. This pre-processing was done in order to convert the individual city-wise values of AQI into one uniform and meaningful dataset to be used in further analysis and modeling.
-
•
Data splitting and removal of stationarity for ARIMA: The stationarity of the time series data of the air quality index (AQI) of India was tested with the help of the Augmented Dickey-Fuller (ADF) test. ADF is a statistical hypothesis test applied to find the stationary status of a given time series. Stationarity is a basic assumption in time series analysis that refers to the state of the statistical properties of the data remaining constant in nature with respect to time. ADF testing was applied on 'India_AQI' time series data using the adfuller function from statsmodels library. The parameter autolag='AIC' indicates that the number of lags used by the test is automatically selected based on the Akaike Information Criterion (AIC).
Test results are saved to the dftest variable containing the test statistic of the data, number of lags, p-value for the model, and the number of observations. The dftest attribute further holds critical values for various levels of confidence. Then, the pandas Series called df output was created to sort and present test results in an organized manner. The data test statistic, number of lags employed, model p-value, and significance levels of 1 %, 5 %, and 10 % critical values as presented in Table 2 are part of this series.
Table 2.
ARIMA Data test Statistics.
| Category | Values |
|---|---|
| Test Statistics | 1.712156 |
| p-value | 0.998158 |
| #Lags used | 11.000000 |
| Number of Observations used | 60.000000 |
| Critical Value (1 %) | −3.544369 |
| Critical Value (5 %) | −2.911073 |
| Critical Value (10 %) | −3.593190 |
Drawing from the acquired p-value (0.998158), we opt not to dismiss the null hypothesis of the ADF test, suggesting the 'India_AQI' time series is probably non-stationary as seen in Fig. 1. This implies that further data transformations or modeling techniques are necessary to achieve stationarity before conducting time series analysis or forecasting.
Fig. 1.
Graph for data before differencing.
Differencing is performed on the ‘India_AQI’ time series data to achieve stationarity. Differencing stands as a prevalent technique in time series analysis aimed at eliminating trends or seasonality from the data, thereby rendering it stationary. The diff function was employed on the 'India_AQI' time series data utilizing the parameter periods=2, indicating that we are computing the difference between each observation and the observation two time periods ago. This helps in removing any trend or seasonality that occurs over a two-month period. Following differencing, the dropna operation was employed to eliminate any absent values (NaN) that might have arisen due to differencing. This procedure is essential since differencing introduces NaN values for the initial two observations, where differencing computation is not feasible.
On conducting the ADF test again, the p-value came out to be 2.001116e-08 as shown in Table 3, which implies that stationarity was achieved as shown in Fig. 2. Now the data was ready to be split into train and test and to fit the model. It was split according to the dates so as to efficiently split the dataset, train data containing the dataset values from 2015 to 2018 end and the test data containing the rest of the values (i.e. from 2019 to 2020).
-
•
Time-series cross validation: To evaluate model generalizability over time while preserving the temporal structure of the data, we employed an expanding window cross-validation technique. Unlike conventional K-fold validation which risks data leakage in time-series contexts, this approach ensures that only past data is used to predict the future, thereby better reflecting real-world forecasting conditions. The implementation involves:
Table 3.
ARIMA Data test Statistics.
| Category | Values |
|---|---|
| Test Statistics | −6.400496e+00 |
| p-value | 2.001116e-08 |
| #Lags used | 9.000000e+00 |
| Number of Observations used | 6.000000e+01 |
| Critical Value (1 %) | −3.544369e+00 |
| Critical Value (5 %) | −2.911073e+00 |
| Critical Value (10 %) | −3.593190e+00 |
Fig. 2.
Graph for data after differencing.
Data Splitting: The AQI dataset was chronologically sorted. An initial window (e.g., the first 60 % of the data) was used for training, while the immediate next slice (e.g., 10 %) served as the test set.
Expansion Step: After each iteration, the training window was expanded to include the most recent test segment, and a new test segment followed.
Rolling Forward: This process was repeated until the end of the dataset was reached, simulating a real-time learning and forecasting cycle.
The pseudo-code for this technique is shown in Table 4 as Algorithm 1. The advantages of expanding window cross-validation is that it respects temporal causality (no future data is used for training), it captures concept drift or changing patterns in air quality over time and reduces overfitting risk by simulating out-of-sample testing at multiple time horizons.
Table 4.
Pseudo-code for time series cross validation using expanding window strategy.
| Algorithm 1 |
|---|
| Input: |
| D ← Time-series dataset sorted by time (features + AQI) |
| w_init ← Initial training window size (e.g., 60 % of D) |
| w_test ← Test window size (e.g., 10 % of D) |
| model ← Prediction model (e.g., RFR or Hybrid RFR+ARIMA) |
| Output: |
| CV_Metrics ← Mean MSE and R² across all folds |
| Procedure: |
| 1. Initialize: |
| t ← 0 |
| CV_Metrics ← empty list |
| 2. While (t+w_init + w_test ≤ length(D)): |
| a. Train_Set ← D[t: t+w_init] |
| b. Test_Set ← D[t+w_init: t+w_init + w_test] |
| c. Fit model on Train_Set |
| d. Predict AQI on Test_Set |
| e. Compute MSE_t and R²_t |
| f. Append (MSE_t, R²_t) to CV_Metrics |
| g. Expand training window: |
| t ← t |
| w_init ← w_init + w_test |
| 3. Return average(MSE), average(R²) from CV_Metrics |
In each fold, the MSE and R² values were computed and averaged to assess model consistency. This strategy was applied uniformly to both the baseline RFR and the hybrid RFR+ARIMA models, ensuring a fair and robust comparison.
Model deployment
This consists of three phases- train RFR, compute residuals and fit an ARIMA on these residuals as explained below.
Train Random Forest Regressor: Random Forest stands as a predictive modeling technique that amalgamates multiple decision trees. Each tree within the forest is fashioned using independently sampled random vector values, fostering diversity while adhering to the same underlying distribution. Paramount to the Random Forest are its key parameters, namely the number of trees and the number of features. The former dictates the total number of trees within the forest, while the latter specifies the count of features that are randomly selected considered for each decision tree.
The train-test split methodology is employed to partition the data into distinct training and testing sets. In this context, the test_size parameter, typically set at 0.2, designates the proportion of the dataset allocated for testing, with the remaining 80 % earmarked for training the model. Reproducibility between different executions of code is the role of the random_state parameter. By fixing the value of the random seed, the parameter ensures reproducible data shuffling before splitting and thus reproducible outcomes. RandomForestRegressor involves several significant parameters specifically honed to alter the behavior of the random forest. 'n_estimators' controls the number of trees composing the forest, fixed at 200 in this context. 'min_samples_split' controls the minimum number of samples that must be used in splitting an internal node while constructing the tree, with up to 2 samples used in this context. Similarly, 'min_samples_leaf' controls how the minimum number of samples is assigned to each leaf node so that each leaf has at least one sample. The 'max_features' parameter regulates how many features should be used in determining the best split. In this case, the 'sqrt' parameter tests the square root of the number of features at each node. 'max_depth' controls the depth of a single decision tree, limited to 100 levels to avoid overfitting. Lastly, the 'bootstrap' parameter, set to ‘False’, signifies that the entire dataset is utilized for tree construction without employing bootstrapping.
After parameter setup, the model undergoes training using the fit technique on the provided training data (X_train and y_train), encompassing input features (X_train) and corresponding target variables (y_train). Upon completion of training, the model is prepared to generate forecasts on new, unseen test data.
Compute Residuals: In the second phase of the hybrid modeling approach, residual analysis is performed to capture the information that the machine learning model (Random Forest Regression, RFR) fails to explain. Specifically, the residuals are computed by subtracting the predicted AQI values obtained from the trained RFR model from the actual observed AQI values in the training dataset. Mathematically, this can be represented as:
| (6) |
These residuals essentially represent the unexplained variance or error terms from the RFR model, which may still contain meaningful temporal patterns and autocorrelations not captured by the non-linear regression framework. Treating these residuals as a new time series, ARIMA is used to model and predict them. This enables the hybrid system to eliminate the systematic forecasting errors performed by RFR, thus improving the overall prediction accuracy. This is an important step because it enables the hybrid model to combine the temporal learning ability of ARIMA and the feature-based learning power of Random Forest, thus making the overall framework more robust and holistic.
Fit ARIMA on Residuals: ARIMA model, is one of the most common models used in time series forecasting and analysis. It is also known as ARIMA(p, d, q), with 'p' representing the number of autoregressive parameters, 'd' representing the order of differencing that needs to be applied to bring the series to stationarity, and 'q' representing the number of moving average parameters. The model is most appropriate for the administration of non-stationary data, such as air quality readings, and is used in most cases in predictive modeling due to its accuracy. The autoregressive (AR) component in the concept of ARIMA perceives the association between the current value and previous points of the observation and the moving average (MA) component sees the residual parameters of the model. The 'd' parameter is the order of data differencing used to stabilize the data so that it becomes stationary and is required in general modeling. Second-degree differencing is normally sufficient to stabilize the data and convert it into smooth and stationary data, thereby giving the model the name ARIMA(p, 2, q). The autoregressive and moving average are the two main components of the model ARIMA(p, 2, q), with the property of these parameters decaying slowly. 'p' and 'q' values are obtained by using information criteria such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), in this example we used AIC. After we prepared the data to train and run the model, we went step by step to split it into training and testing sets. Our training data contained information from the start of our data in 2015 through to results at the end of 2018. This provided our model with plenty of time to learn from a broad spectrum of previous observations. Meanwhile, the testing set covered the remaining time period, from 2019 to 2020. By dividing the data this way, we could effectively evaluate how accurately our model predicts across different time frames, ensuring that our assessments were both reliable and robust.
The Hybrid RFR+ARIMA model flow chart is shown in Fig. 3. The overall Hybrid RFR-ARIMA pseudo-algorithm for AQI prediction is given in Table 5 as Algorithm 2 and the SHAP pseudo-code is given in Table 6 as Algorithm 3.
Fig. 3.
Flow-chart of the Hybrid RFR+ARIMA model for AQI Prediction.
Table 5.
Overall Hybrid RFR-ARIMA pseudo-algorithm for AQI prediction.
| Algorithm 2 |
|---|
| Input: Historical air quality dataset D containing pollutant concentrations (e.g., PM2.5, PM10, NO₂, SO₂, CO, O₃), timestamps, and corresponding AQI values. |
| Output: Final AQI predictions (AQI_pred_final) and model evaluation metrics (MSE, R²). |
| Data Preprocessing |
| a. Load dataset D (station_day.csv for finer granularity) |
| b. Handle missing values using imputation techniques (e.g., forward fill, mean). |
| c. Convert date fields into datetime format and extract relevant time-based features. |
| d. Normalize or scale data (optional for RFR). |
| Train–Test Split (80:20) |
| a. Split the dataset into training set D_train and testing set D_test (e.g., split by year). |
| b. Extract features X_train, X_test and targets y_train, y_test. |
| Train Random Forest Regressor (RFR) |
| a. Initialize and train the RFR_model on X_train, y_train. |
| b. Predict AQI values on both sets: |
| • AQI_pred_rfr_train = RFR_model.predict(X_train) |
| • AQI_pred_rfr_test = RFR_model.predict(X_test) |
| Compute Residuals |
| a. Calculate training residuals: |
| • residual_train = y_train − AQI_pred_rfr_train |
| Train ARIMA Model on Residuals |
| a. Treat residual_train as a univariate time series. |
| b. Determine ARIMA order (p, d, q) using AIC/BIC. |
| c. Fit the ARIMA model on residual_train. |
| d. Forecast residuals for the test period: |
| • residual_forecast = ARIMA_model.forecast(steps = len(y_test)) |
| Generate Final Hybrid Prediction |
| a. Combine RFR prediction and ARIMA residuals: |
| • AQI_pred_final = AQI_pred_rfr_test + residual_forecast |
| Model Evaluation |
| a. Evaluate AQI_pred_final against y_test using metrics such as: |
| • Mean Squared Error (MSE) |
| • R² Score |
| b. Visualize actual vs. predicted AQI values. |
| SHAP analysis |
| End |
Table 6.
SHAP Analysis pseudo-code for AQI Prediction Model.
| Algorithm 3 |
|---|
| Input: |
| - Trained Random Forest Regressor model (RFR) |
| - Feature dataset X (e.g., PM2.5, PM10, NO₂, etc.) |
| Output: |
| - SHAP values for each feature |
| - Ranked feature importance |
| Steps: |
| 1. Import SHAP library and initialize explainer: |
| explainer ← shap.TreeExplainer(RFR) |
| 2. Compute SHAP values for all input samples: |
| shap_values ← explainer.shap_values(X) |
| 3. Aggregate SHAP values: |
| For each feature i: |
| mean_SHAP_i ← mean(|shap_values[:, i]|) |
| 4. Rank features by mean_SHAP_i to determine global importance |
| 5. Interpret results: |
| - Identify top contributing features (e.g., PM2.5, PM10, NO₂) |
| End |
Evaluation metrics
The evaluation metrics give quantitative measures to compare the anticipated yield values with the real observed values. It shows how well the model captures the fundamental patterns and generates accurate predictions. The Mean squared difference between the actual and projected values is known as the mean squared error or MSE, and it indicates the overall error of the model. Lower values show way better execution, with the esteem of showing an idealized fit.
| (6a) |
where, n= number of data points, = observed values and =predicted values.
The coefficient of Determination (R2) measures the extent of the change within the observed yield values that are clarified by the model's predictions. It ranges from 0 to 1, with 1 showing an idealized fit and 0 showing no relationship between the predicted and real values. Higher R-squared values mean way better predictive performance.
| (7) |
Method validation
Exploratory data analysis is carried out on each of the five .csv files that comprises the dataset. Fig. 4 reveals significant variations in air quality across Indian cities from the city_day.csv dataset. From Fig. 4(a), most AQI values ranged between 50 and 200, indicating “Satisfactory” to “Moderate” air conditions, though cities like Delhi frequently surpassed this range. Seasonal trends in Fig. 4(b) were evident, with AQI peaking in winter months due to lower dispersion and festive pollution. Pollutants such as PM2.5 and PM10 were the primary contributors to poor air quality as shown in Fig. 4(c), especially in industrial and densely populated cities like Delhi and Kolkata. From the city_hour.csv dataset results as shown in Fig. 4, hourly trends showed AQI levels peaking during morning (7–10 AM) and evening (7–9 PM) rush hours in Fig. 5(a). City-level comparisons in Fig. 5(b) confirmed Delhi’s consistently high AQI, while cities like Mumbai and Hyderabad maintained better air quality. A correlation analysis in Fig. 5(c) highlighted PM2.5 and PM10 as the most influential pollutants on AQI. Analyzing station_day.csv in Fig. 6(a) reveals AQI values mostly range from 50 to 200, with notable outliers indicating severe pollution at some stations. From Fig. 6(b), Seasonal AQI trends are prominent, with pollution peaking in late-year months and dropping during monsoon and from Fig. 6(c) PM2.5 and PM10 are the strongest contributors to AQI, reaffirming the role of particulate matter in air quality deterioration. From stations.csv dataset, it is found from Fig. 7(a) that Maharashtra, UP, and Andhra Pradesh have the highest number of monitoring stations, reflecting broad regional coverage. From Fig. 7(b), Delhi leads in city-wise city-level monitoring, followed by other major metros like Mumbai and Hyderabad and from Fig. 7(c), some inactive stations are noticed highlighting a potential gap in operational infrastructure, especially if inactive stations are located in high-pollution zones.
Fig. 4.
EDA results from city_day.csv dataset; (a) Distribution of AQI values; (b) AQI trend over time (top 6 cities); and (c) distribution of pollutant concentrations by city.
Fig. 5.
EDA results from city_hour.csv dataset; (a) Average AQI by hour of the day; (b) AQI distribution in top 5 cities (hourly data); and (c) Correlation between pollutants and AQI.
Fig. 6.
EDA results from station_day.csv dataset; (a) Distribution of AQI across monitoring stations; (b) Average AQI trend over time (all stations); and (c) correlation between pollutants and AQI (Station-level).
Fig. 7.
EDA results from stations.csv dataset; (a) Average AQI by hour of the day; (b) AQI distribution in top 5 cities (hourly data); and (c)Monitoring stations status distribution.
Results from Random Forest Regressor: After evaluation on the test dataset, the RFR model achieved a Mean Squared Error (MSE) of 524.64 as shown in Table 6, indicating that, on average, the model’s AQI predictions deviated moderately from the actual values. Additionally, the model secured a high R² score of 0.917, meaning it successfully explained approximately 91.7 % of the variance in the AQI values. This high R² value demonstrates that the Random Forest Regressor was able to capture the complex, non-linear relationships between the pollutants and the AQI with strong accuracy. However, to further enhance the predictive performance and to address any remaining patterns not captured by the RFR (i.e., residuals), we model these residuals using an ARIMA time series model with RFR as baseline for better AQI prediction accuracy.
Results from Hybrid RFR+ARIMA Model: The hybrid modeling approach integrated the RFR and ARIMA techniques. Even with a high-performing model like Random Forest, some patterns in the residuals (the differences between actual and predicted AQI) remained unexplained as shown in Fig. 8. The purple line shows the difference between the actual AQI and the Random Forest Regressor’s predictions. The dashed black line represents the zero-residual baseline (perfect prediction line). The residuals fluctuate around zero, making them ideal candidates for ARIMA modeling to capture and forecast underlying time-dependent patterns. ARIMA specializes in modeling sequential time-dependent structures, thus it effectively captured the autocorrelated patterns left behind by the Random Forest.
Fig. 8.
Residuals graph.
To evaluate the adequacy of the ARIMA(1,0,1) model in capturing temporal dependencies, we performed the Ljung-Box test on the model residuals across 30 lags. The resulting p-values were consistently above the standard significance threshold of 0.05, indicating no statistically significant autocorrelation in the residuals. This confirms that the residuals approximate white noise, and that the ARIMA model has effectively accounted for the temporal structure in the AQI time series. The diagnostic plot (see Fig. 9) supports this conclusion and validates the inclusion of ARIMA as a reliable residual correction layer in the proposed hybrid framework.
Fig. 9.
Ljung-Box test on ARIMA residuals.
While the hybrid RFR+ARIMA model successfully smooths systematic errors and improves overall predictive accuracy, it may inadvertently dampen high-frequency noise that characterizes real-world AQI data. Such noise often corresponds to short-term pollution spikes due to events like traffic congestion, weather anomalies, or localized emissions, which are crucial for alert systems and health advisories. Our current framework prioritizes trend consistency and interpretability, which may slightly compromise the model’s responsiveness to abrupt variations. To balance this, we maintained minimal smoothing during imputation and verified residual variance through diagnostic checks. Fig. 10 depicts the Residual Variance and Uncertainty Bands of ARIMA Predictions. It is observed that the model predictions are generally close to the actual AQI. The uncertainty bands represent natural variability and allow for short-term fluctuation analysis, addressing concerns about overly smooth predictions. The bands widen slightly where residual variance increases, confirming the model’s sensitivity to temporal dynamics.
Fig. 10.
Residual Variance and Uncertainty Bands of ARIMA Predictions.
Performance results of baseline RFR and Hybrid RFR+ARIMA models is shown in Table 7. By adding the ARIMA forecast of the residuals to the original Random Forest predictions, the hybrid model achieved an improved MSE of 508.46 and an R² of 0.940. This represents a significant reduction of 16.18 in MSE and 2.3 % in R² improvement, demonstrating that ARIMA successfully corrected subtle time-dependent errors made by the Random Forest. While the hybrid RFR+ARIMA model exhibits a modest numerical improvement over the baseline Random Forest Regressor, its added value lies in modeling residual temporal dependencies that the tree-based RFR model cannot inherently capture. The ARIMA component effectively addresses short-term temporal autocorrelation in residuals, resulting in more stable and reliable forecasts, especially over extended time horizons. Importantly, this hybrid approach introduces only limited complexity compared to deep learning-based ensembles, while remaining fully compatible with interpretability frameworks like SHAP (implemented), LIME, and Partial Dependence Plots. This ensures that the model adheres to principles of explainable and trustworthy AI, which are vital in environmental health domains. Thus, the hybridization is not merely about marginal gains in metrics but about robustness, interpretability, and long-term forecasting reliability, which make it suitable for real-world air quality monitoring and policy support applications.
Table 7.
Performance results of RFR and Hybrid RFR+ARIMA models.
| Model | MSE | R² |
|---|---|---|
| Baseline Random Forest Regressor (RFR) | 524.64 | 0.917 |
| Proposed Hybrid RFR + ARIMA | 508.46 | 0.940 |
The graph in Fig. 11(a) illustrates the comparison between the actual AQI values (India_AQI) and the hybrid model’s predicted AQI values (Predicted Values) over time. The orange curve represents the true observed AQI, while the blue curve shows the AQI predicted by the hybrid Random Forest + ARIMA model. The pink shaded region indicates the prediction uncertainty or confidence interval around the model's forecast. The model was trained on historical data until early 2019, after which the prediction (blue line) extends into the forecast period. The gap between the blue and orange curves, and the width of the pink band, reflect the model's accuracy and confidence. A narrower pink band suggests higher confidence in the prediction, while larger gaps or wider bands indicate increased uncertainty. Overall, the graph highlights the model's capability to closely follow the real AQI trends with reasonable confidence, especially during forecast periods.
Fig. 11.
Graphs showing Predicted vs. original AQI values; (a) till 2019; (b) 2020–2025.
In Graph Fig. 11(b), the orange curve represents the observed AQI data from 2015 to early 2020, while the blue curve extends into future years (2020–2025) to project AQI levels.
The model captures seasonal AQI patterns, evident in the cyclical rise and fall seen in the predictions beyond 2020. The consistent seasonal cycles likely reflect recurring environmental factors such as weather changes, industrial emissions, and traffic patterns. As expected, the uncertainty widens farther into the future (2023–2025), reflecting greater difficulty in long-term forecasting. This graph effectively shows both the historical AQI behavior and the model’s confidence in its long-term AQI forecasts, providing valuable insights for air quality management and policy planning.
To ensure a robust assessment of model performance, we adopted an expanding window cross-validation strategy that respects the temporal ordering of the AQI data. This approach allowed us to evaluate the hybrid model on sequential future intervals while avoiding data leakage, a common risk in random K-fold cross-validation with time series. We benchmarked the hybrid RFR+ARIMA model against the standalone RFR and ARIMA models, as well as against performance reported in related literature for LSTM [35], CNN-LSTM [36], XGBoost with feature engineering [37], and GRU with attention mechanisms [38] as shown in Fig. 12. Despite its lower architectural complexity, the hybrid model outperformed the baselines in both MSE and R² while maintaining interpretability. Additionally, ARIMA residual diagnostics, including autocorrelation plots and Ljung-Box statistics, showed no significant residual autocorrelation, suggesting that the model is well-specified and not overfitted. These findings support the hybrid model’s effectiveness in balancing predictive performance with practical interpretability.
Fig. 12.
Performance comparison of proposed model with current state-of-the-art models.
To help understand the contribution of each feature in the hybrid model, Table 8 captures results of SHAP analysis and lists the top 3 features impacting AQI prediction. Feature importance analysis revealed that PM2.5, PM10, and NO₂ were the top three contributors to AQI prediction. PM2.5 was the most influential, reflecting its critical role in determining air quality, followed closely by PM10 and NO₂, both known for their strong environmental and health impacts.
Table 8.
Feature importance results using SHAP analysis.
| Rank | Feature | Feature Score and Reason |
|---|---|---|
| 1 | PM2.5 | Score= 0.38. Fine particulate matter that directly affects human health and dominates AQI calculation. |
| 2 | PM10 | Score= 0.28. Larger particulate matter that also significantly impacts AQI, especially in dusty and urban areas. |
| 3 | NO₂ | Score= 0.14. Nitrogen dioxide is a major pollutant from vehicle emissions and industrial activities, heavily influencing air quality. |
While the individual components of our model, Random Forest and ARIMA are well-established, the novelty of this work lies in the strategic integration and application context. Specifically, we design a two-stage hybrid framework where Random Forest first learns the nonlinear pollutant-to-AQI mapping, and ARIMA then models the residuals to capture remaining temporal dependencies. This residual-corrective structure differs from common ensemble or parallel hybrids by isolating and targeting distinct modeling deficiencies. Moreover, our model emphasizes interpretability, employing SHAP to make feature influence transparent, an essential but often overlooked aspect in air quality forecasting literature. Lastly, we validate the model using time-respecting cross-validation and benchmark it against deep learning alternatives, thus contributing both methodological clarity and practical relevance to hybrid air quality forecasting systems.
Limitations
Despite its good performance, the hybrid Random Forest Regressor + ARIMA model is subject to some limitations. To start with, the Random Forest model relies greatly on the presence and quality of historical pollutant concentration data; incomplete or inconsistent measurements may decrease prediction accuracy. However, the model designed is extensible. Exogenous variables such as meteorological parameters (e.g., temperature, wind speed, humidity), population density, and policy interventions can significantly influence air quality dynamics. They can be seamlessly integrated in future work, enhancing generalizability without modifying the core architecture. Second, the ARIMA part relies on linear trends in the residuals, which can fail to represent intricate, non-linear dynamics in air quality due to abrupt environmental events such as festivals, fires, or lockdowns. The model also excludes external factors such as weather, road conditions, or policy measures that have significant impacts on AQI but are not modeled here. Future work may incorporate noise-preserving models or probabilistic approaches (e.g., quantile regression forests or Monte Carlo dropout) to better retain short-term variability without sacrificing explainability or robustness. Forecasting in the long run also introduces greater uncertainty, as the compounded errors in the long run are visible in the prediction interval coverage after the year 2023. While the hybrid Random Forest-ARIMA model shows improved forecasting accuracy and interpretability, the Random Forest component, though powerful for nonlinear regression, lacks internal temporal awareness, while ARIMA assumes linear stationarity and can struggle with rapidly shifting pollutant dynamics. Furthermore, RF and ARIMA, when not tightly integrated architecturally, risk independent error accumulation. Our model mitigates these to some extent via residual correction and explainability layers, but we acknowledge that more advanced neural architectures, such as graph convolutional networks (GCNs), attention-enhanced LSTM/GRUs, or transformer-based temporal models could further improve spatiotemporal learning. We present our model as a transparent, computationally efficient alternative, especially for settings where interpretability is prioritized over sheer predictive power.
Ethics statements
None
Related research article
None
Supplementary material and/or additional information
None
CRediT authorship contribution statement
Anuradha Yenkikar: Conceptualization, Resources, Data curation, Software, Validation. Ved Prakash Mishra: Supervision, Project administration. Manish Bali: Conceptualization, Methodology, Formal analysis, Writing – original draft, Writing – review & editing. Tabassum Ara: Software, Validation, Supervision.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
Data will be made available on request.
References
- 1.United Nations; 2015. Transforming Our world: the 2030 Agenda for Sustainable Development.https://sdgs.un.org/2030agenda [Online]. Available. [Google Scholar]
- 2.Ministry of Environment, Forest and Climate Change. Central Pollution Control Board (CPCB); India: 2014. National Air Quality Index. [Google Scholar]
- 3.Liu H., Li Q., Yu D., Gu Y. Air quality index and air pollutant concentration prediction based on machine learning algorithms. Appl Sci. 2019;9(19):4069. [Google Scholar]
- 4.Castelli M., Clemente F.M., Popovic A., Silva S., Vanneschi L. A machine learning approach to predict air quality in California. Complexity. 2020 Article ID 8049504, 2020. [Google Scholar]
- 5.Mani G., Viswanadhapalli J.K., Stonie A.A. Prediction and forecasting of air quality index in Chennai using regression and ARIMA time series models. J Eng Res. 2021;9 [Google Scholar]
- 6.Panda P.R., Sameen B.S.L.S., Balasubramanian R. ARIMA modeling of air pollution in major Indian cities. Environ Sci Pollut Res. 2021;28:39250–39263. [Google Scholar]
- 7.Al-Eidi S., et al. Comparative analysis study for air quality prediction in smart cities using regression techniques. IEEE Access. 2023;11:115140–115149. [Google Scholar]
- 8.Sharma A., et al. Temporal ensemble models for air quality forecasting in smart cities. Environ Inform Lett. 2024;12(1):45–58. doi: 10.1016/j.envinf.2024.01.005. [DOI] [Google Scholar]
- 9.Soundari A.G., Gnana J., Akshaya A.C. Indian air quality prediction and analysis using machine learning. Int J Appl Eng Res. 2019;14:11. [Google Scholar]
- 10.Natarajan S., et al. Optimized machine learning model for air quality index prediction in major cities in India. Sci ReportsSci Rep. 2024;14:6795. doi: 10.1038/s41598-024-54807-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Maleki H., et al. Air pollution prediction by using an artificial neural network model. Clean Technol Environ Policy. 2019;21(6):1341–1352. doi: 10.1007/s10098-019-01709-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yenkikar A., Mishra V.P., Bali M., Ara T. An explainable AI-based hybrid machine learning model for interpretability and enhanced crop yield prediction. MethodsX. 2025;103442–103442 doi: 10.1016/j.mex.2025.103442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chen L., Rao Y. SHAP-based explainable framework for time-series environmental data. Appl AI Adv. 2025;7(2):101–113. doi: 10.1016/j.aiaa.2025.02.003. [DOI] [Google Scholar]
- 14.Bali M., Mishra V.P., Yenkikar A. Artificial bee colony optimized random forest model for prediction of fly ash concrete compressive strength. MethodsX. 2025;103412–103412 doi: 10.1016/j.mex.2025.103412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dataset: https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india.
- 16.Liu H., Li Q., Yu D., Gu Y. Air quality index and air pollutant concentration prediction based on machine learning algorithms. Appl Sci. 2019;9:4069. [Google Scholar]
- 17.Castelli M., Clemente F.M., Popovic A., Silva S., Vanneschi L. A machine learning approach to predict air quality in California. Complexity. 2020 Article ID 8049504, 23 pages, 2020. [Google Scholar]
- 18.Mani G., Viswanadhapalli J.K., Stonie A.A. Prediction and forecasting of air quality index in Chennai using regression and ARIMA time series models. J Eng Res. 2021;9 [Google Scholar]
- 19.Kottur S.V., Mantha S.S. An integrated model using Artificial Neural Network (ANN) and Kriging for forecasting air pollutants using meteorological data. Int J Adv Res Comput Commun Eng. 2015;4:146–152. [Google Scholar]
- 20.Halsana S. Air quality prediction model using supervised machine learning algorithms. Int J Sci Res Comput Sci Eng Inf Technol. 2020;8:190–201. [Google Scholar]
- 21.Soundari A.G., Gnana J., Akshaya A.C. Indian air quality prediction and analysis using machine learning. Int J Appl Eng Res. 2019;14(24):11. Journal of Environmental and Public Health. [Google Scholar]
- 22.Al-Eidi S., Amsaad F., Darwish O., Tashtoush Y., Alqahtani A., Niveshitha N. Comparative analysis study for air quality prediction in smart cities using regression techniques. IEEE Access. 2023;11:115140–115149. doi: 10.1109/ACCESS.2023.3323447. [DOI] [Google Scholar]
- 23.Natarajan S.K., Shanmurthy P., Arockiam D., et al. Optimized machine learning model for air quality index prediction in major cities in India. Sci Rep. 2024;14:6795. doi: 10.1038/s41598-024-54807-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Maleki H., Sorooshian A., Goudarzi G., Baboli Z., Birgani Y.Tahmasebi, Rahmati M. Air pollution prediction by using an artificial neural network model. Clean Technol Environ Policy. 2019;21(6):1341–1352. doi: 10.1007/s10098-019-01709-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Singh K.P., Gupta S., Rai P. Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmos Environ Env. 2013;80:426–437. [Google Scholar]
- 26.Zhou Y., De S., Ewa G., Perera C., Moessner K. Data driven air quality characterization for urban environments: a case study. IEEE Access. 2018;6 Article ID 77996. [Google Scholar]
- 27.Mahalingam U., Elangovan K., Dobhal H., Valliappa C., Shrestha S., Kedam G. A machine learning model for air quality prediction for smart cities. Proceedings of the 2019 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET); Chennai, India; March 2019. pp. 452–457. [Google Scholar]
- 28.Sivakumar V., Kanagachidambaresan G.R., Dhilip kumar V., Arif M., Jackson C., Arulkumaran G. Energy-efficient markov-based lifetime enhancement approach for underwater acoustic sensor network. J Sens. 2022;202210 pages, Article ID 3578002. [Google Scholar]
- 29.Behesht Abad A.R., Mousavi S., Mohamadian N., et al. Hybrid machine learning algorithms to predict condensate viscosity in the near wellbore regions of gas condensate reservoirs. J Nat Gas Sci Eng Nat Gas Sci Eng. 2021;95 Article ID 104210. [Google Scholar]
- 30.Sarkar N., Gupta R., Keserwani P.K., Govil M.C. Air Quality Index prediction using an effective hybrid deep learning model. Env Pollut. 2022 Dec 15;315 doi: 10.1016/j.envpol.2022.120404. Epub 2022 Oct 11. PMID: 36240962. [DOI] [PubMed] [Google Scholar]
- 31.Hansun S., Bonar Kristanda M. AQI measurement and prediction using B-wema method. Int J Eng Res Technol. 2019;12:1621–1625. [Google Scholar]
- 32.Chowdhury A.-S., Uddin M.S., Tanjim M.R., Noor F., Rahman R.M. Application of data mining techniques on air pollution of Dhaka City. Proceedings of the 2020 IEEE 10th International Conference on Intelligent Systems (IS); Varna, Bulgaria; August 2020. pp. 562–567. [Google Scholar]
- 33.Yuan Z., Liu Y., Li X., Tang Z. An explainable machine learning model for PM2.5 prediction using XGBoost and SHAP. Atmos Environ Env. 2022;285 doi: 10.1016/j.atmosenv.2022.119260. [DOI] [Google Scholar]
- 34.Dang N., Zhang J., Lin Y., et al. Interpretable air quality prediction using LightGBM with SHAP and LIME: a case study across multiple Chinese cities. Sci Total Environ. 2023;875 doi: 10.1016/j.scitotenv.2023.162621. [DOI] [Google Scholar]
- 35.Li X., et al. Air quality prediction based on LSTM neural networks. Environ Model Softw. 2022;149 doi: 10.1016/j.envsoft.2022.105296. [DOI] [Google Scholar]
- 36.Zhang Y., et al. A hybrid CNN-LSTM model for PM2.5 forecasting. Atmos Pollut Res Pollut Res. 2023;14(2) doi: 10.1016/j.apr.2022.101495. [DOI] [Google Scholar]
- 37.Wang H., et al. XGBoost-based air pollution forecasting with feature selection. Sustain Cities Soc. 2022;85 doi: 10.1016/j.scs.2022.104065. [DOI] [Google Scholar]
- 38.Chen L., et al. GRU-Attention Network for short-term air quality prediction. Ecol Indic Indic. 2023;148 doi: 10.1016/j.ecolind.2023.110125. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request.













