Abstract
Air quality degradation poses significant risks to human health and ecosystems, particularly in rapidly urbanizing and industrialized arid regions. Meteorological conditions strongly influence the formation, transport, and dispersion of air pollutants, yet their relationships are highly nonlinear and difficult to quantify using conventional statistical approaches. This study investigates the influence of meteorological parameters on key air pollutants in the Eastern Region of Saudi Arabia using machine learning techniques. Five years of observational data (2017–2021), including temperature, humidity, wind speed, wind direction, dew point, and atmospheric pressure, were analyzed alongside concentrations of nitrogen dioxide (NO2), carbon monoxide (CO), and particulate matter (PM10). Four machine learning algorithms including Neural Networks (NN), Decision Trees (DT), Random Forests (RF), and Gradient Boosting (GB) were evaluated using standard performance metrics; mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R2). The results indicate that meteorological parameters exert pollutant-specific influences. The GB model achieved the highest predictive accuracy for NO2 (R2 ≈ 0.83), highlighting the dominant role of humidity, dew point, and seasonal variation. Moderate predictive performance was observed for CO (R2 ≈ 0.46), suggesting a combined influence of meteorology and emission-driven processes. In contrast, PM10 exhibited weak correlations with meteorological variables, reflecting the dominance of episodic dust events and non-meteorological factors in arid environments. These findings demonstrate the effectiveness of ensemble machine learning models in capturing nonlinear meteorological-pollutant relationships. The study provides valuable insights for air quality forecasting and supports data-driven environmental management in arid and semi-arid regions.
Keywords: Artificial intelligence, Air quality, Meteorological conditions, Machine learning, Saudi arabia
Subject terms: Climate sciences, Environmental sciences, Mathematics and computing
Introduction
Air quality represents the state of the atmosphere in relation to the presence of pollutants that affect human health, ecosystems, and overall environmental quality. Poor air quality is associated with respiratory and cardiovascular diseases, reduced visibility, and ecosystem degradation1,2. It is influenced by a combination of natural and anthropogenic factors, including emissions from industrial processes, vehicular traffic, and domestic sources, as well as meteorological conditions, topography, and vegetation cover3,4. Among these factors, meteorological parameters play a fundamental role in determining the transport, transformation, and dispersion of air pollutants. Parameters such as temperature, humidity, wind speed, wind direction, and precipitation influence the formation and removal processes of gaseous and particulate pollutants, thereby governing local and regional air quality dynamics5. Temperature is one of the most significant meteorological factors affecting atmospheric chemistry. Elevated temperatures increase the rate of photochemical reactions that lead to the production of ground-level ozone (O3). This secondary pollutant forms through complex reactions between nitrogen oxides (NOx) and volatile organic compounds (VOCs) in the presence of sunlight6. As a result, ozone levels are typically higher during summer months and in densely populated urban areas. Furthermore, high temperatures may induce thermal inversions, where a layer of warm air traps cooler air near the ground. These inversions prevent vertical mixing, thereby confining pollutants close to the surface and deteriorating air quality7.
Relative humidity also plays a crucial role in atmospheric processes by affecting the physical and chemical transformations of airborne particles. High humidity levels can promote the formation of secondary particulate matter (PM), through condensation of gaseous pollutants and heterogeneous reactions between gas-phase and liquid-phase species8. Conversely, in dry conditions, atmospheric particles tend to remain suspended for longer periods, enhancing the persistence of particulate pollution. Wind speed and direction govern the dispersion and transport of pollutants across regions. High wind speeds enhance dilution by dispersing pollutants over larger areas, reducing their local concentrations. In contrast, stagnant air and low wind speeds allow pollutants to accumulate, particularly in urban environments and topographically enclosed basins. Wind direction determines the trajectory of pollutant plumes, thereby influencing which regions experience higher exposure levels. Variations in wind patterns may occur due to diurnal heating, seasonal changes, and atmospheric pressure gradients9. Precipitation also serves as an important atmospheric cleansing mechanism, as it facilitates the removal of pollutants through wet deposition processes such as rainfall or snowfall. Moreover, precipitation events lower ambient temperature and humidity, which can reduce the formation rates of ozone and particulate matter10.
Despite the well-established importance of these meteorological parameters, quantifying their relationship with air pollutant concentrations remains complex. The interactions among temperature, humidity, wind, and pollutant sources are nonlinear and interdependent, making it difficult to derive consistent relationships using traditional linear statistical methods. Consequently, it is necessary to employ advanced computational approaches capable of capturing these nonlinear dynamics. Correlation analysis, regression, and other classical statistical methods provide valuable insights but are often insufficient for identifying intricate, multidimensional relationships11. Artificial intelligence (AI) and machine learning (ML) methods have emerged as powerful tools for analyzing complex environmental datasets characterized by high dimensionality and nonlinearity12,13. AI, ML algorithms are designed to improve their performance through data-driven training processes. These algorithms can uncover hidden dependencies among environmental variables, making them particularly useful for air quality modeling14. ML techniques are generally categorized into supervised, unsupervised, and reinforcement learning approaches. Supervised learning methods are relevant for environmental prediction, as they use historical input-output data to model complex relationships between meteorological and air quality parameters15.
Numerous studies have demonstrated the capability of AI models in forecasting air pollutant concentrations and evaluating the influence of meteorological conditions. NN can model nonlinear and multivariate dependencies between input and output variables, while ensemble methods such as RF and GB enhance predictive performance through the aggregation of multiple weak learners. These algorithms have been successfully applied in several regions, including China, the United States, and Australia, to investigate pollutant dynamics under diverse climatic conditions5,16–19. Despite the growing application of artificial intelligence and machine learning techniques in air quality research, several gaps remain in their application to arid and semi-arid environments. Most existing AI-based studies focus on temperate or coastal regions, where atmospheric moisture, boundary layer dynamics, and emission patterns differ substantially from desert climates. In arid regions such as the Eastern Province of Saudi Arabia, extreme temperatures, frequent dust events, low and highly variable humidity, and strong seasonal wind regimes introduce unique nonlinear interactions between meteorology and pollutant behavior that are not well captured in current models. Moreover, many previous studies emphasize overall prediction accuracy without systematically comparing multiple machine learning algorithms under identical data and feature-selection frameworks or evaluating pollutant-specific meteorological controls. To address this research gap, the present study investigates the relationship between meteorological conditions and key air quality parameters in the Eastern Region of Saudi Arabia using ML-based modeling. Specifically, four ML algorithms including NN, DT, RF, and GB were applied to evaluate the relationship between meteorological parameters (temperature, humidity, wind speed, wind direction, and pressure) and air quality indicators. Five years of observational data (2017–2021) were used to train and validate the models. The specific objectives of this study are to:
-
i.
Quantify the influence of individual meteorological parameters on the concentration levels of key air pollutants.
-
ii.
Compare the predictive performance of different AI algorithms in capturing these relationships and.
-
iii.
Identify the most significant meteorological predictors for each pollutant.
The present study advances the literature by (i) providing a comparative assessment of four widely used machine learning algorithms under arid climatic conditions, (ii) quantifying pollutant-specific meteorological influences for NO₂, CO, and PM₁₀ using a consistent feature-ranking approach, and (iii) demonstrating the strengths and limitations of meteorology-driven AI models for different pollutant types. These contributions offer new insights into the applicability and constraints of AI-based air quality modeling in desert environments and support the development of more targeted, region-specific forecasting and management strategies.
Methodology
Study area
The study was conducted in the Dammam Metropolitan Area, located in the Eastern Province of Saudi Arabia (Fig. 1). This region represents one of the country’s most industrialized and urbanized zones, characterized by numerous oils, gas, and petrochemical industries, power plants, and transportation networks that contribute to both gaseous and particulate emissions. Dammam lies along the Arabian Gulf coast and experiences a desert climate dominated by high temperatures, low humidity, and limited rainfall throughout the year. Seasonal variations in temperature and wind patterns influence pollutant dispersion, making the area an ideal location for studying the impact of meteorological factors on air quality. Meteorological conditions in the region are typically governed by the interaction between the Arabian Gulf sea breeze and continental desert air masses. Climatically, the region exhibits arid desert conditions, characterized by extremely high summer temperatures, mild winters, and very low annual rainfall. Summer temperatures frequently exceed 45 °C, accompanied by high solar radiation and frequent dust events. Relative humidity ranges from 30 to 80%, depending on proximity to the Arabian Gulf, while precipitation is sporadic, averaging less than 100 mm annually. Seasonal wind regimes include calm air periods conducive to pollutant accumulation, as well as strong northwesterly “Shamal” winds that facilitate dispersion. The meteorological variability, combined with industrial and traffic emissions, makes Dammam an ideal testbed for studying the relationship between weather dynamics and air pollution under arid conditions.
Fig. 1.
Map of the Study Area, Eastern Province, Saudi Arabia, Showing the locations of meteorological and air quality monitoring stations used in this study.
Data collection and sources
This study integrates five years (2017–2021) of meteorological and air quality datasets obtained from multiple sources to ensure temporal and spatial consistency.
Meteorological and air quality data
Meteorological data were sourced from the King Fahd International Airport (KFIA) weather station, operated under the General Authority of Meteorology and Environmental Protection (GAMEP). Data were recorded twice daily, representing diurnal and nocturnal atmospheric conditions. The following parameters were obtained: Temperature (oF), Dew point (oF), Relative humidity (%), Wind speed (m/s), Wind direction (o), Atmospheric pressure (in-Hg), Precipitation (mm). However, precipitation was not consistently recorded during the study period, it was excluded from model training and analysis to maintain data reliability. Additional derived variables were introduced to enhance temporal representation, including day-of-year to account for seasonal effects and categorical encoding of wind direction to capture directional variability in pollutant dispersion.
Air quality data for the same period were collected from public and private air monitoring stations within the Dammam metropolitan area. The key pollutants investigated were: Carbon monoxide (CO), Nitrogen dioxide (NO2) and Particulate Matter (PM10). The selection of these pollutants was based on their prevalence in urban-industrial environments and their distinct emission characteristics. CO and NO2 represent primary pollutants associated with vehicular exhaust and combustion activities, whereas PM₁₀ serves as a major indicator of dust and secondary aerosol formation. Concentrations were expressed in mg/L and synchronized with corresponding meteorological records based on date and time stamps to ensure alignment of atmospheric and pollutant datasets.
Data preprocessing and software tools
Before model development, the datasets underwent systematic preprocessing to improve data quality and ensure compatibility with machine learning algorithms. Records with missing values accounted for approximately 5–6% of the total dataset and were removed to preserve temporal consistency and avoid bias associated with extensive imputation. Outliers were identified using a statistical threshold based on the three-standard-deviation rule, whereby values exceeding three standard deviations from the mean were considered anomalous. These outliers were examined individually and replaced using median interpolation, which was selected due to its robustness against skewed distributions and its ability to preserve the central tendency of environmental variables that are often non-normally distributed.
Numerical Encoding were carried and a day-of-year variable (ranging from 1 to 365) was incorporated to capture cyclic seasonal trends that influence pollutant concentration patterns. Following preprocessing, the final dataset contained seven meteorological predictors and three dependent air quality parameters. Feature ranking was conducted using the training dataset under cross-validation to ensure stability of importance scores. For each pollutant (NO₂, CO, and PM₁₀), predictors were ranked according to their relative importance, and the top one to three variables were retained for subsequent model development. This strategy allowed systematic evaluation of how predictive performance evolved with increasing model complexity, while minimizing overfitting and noise introduced by weakly informative features. The final number of predictors therefore varied by pollutant and model configuration, depending on the optimal balance between accuracy and generalization. All computations were performed using Orange 3.36.1, an open-source visual programming platform for machine learning. Orange integrates a wide range of ML algorithms, validation tools, and visualization modules suitable for environmental modeling. The data were divided into two subsets: Approximately 80% of the data were used for model training, while the remaining 20% were reserved for validation and performance testing.
Machine learning algorithms and model validation
Four ML algorithms were employed to model and predict the relationship between meteorological variables and air quality parameters. These algorithms were chosen due to their complementary strengths in capturing linear and nonlinear dependencies.
Neural network (NN)
The NN algorithm, also known as an artificial neural network, is composed of interconnected processing nodes (neurons) organized in input, hidden, and output layers20. The NN used in this study comprised one hidden layer with adjustable neuron counts optimized via cross-validation. Each neuron applies an activation function to a weighted sum of inputs, enabling the network to learn complex nonlinear mappings between predictors and responses (Fig. 2). The NN models were trained using a backpropagation learning algorithm with gradient descent optimization. Regularization techniques, such as dropout, were implemented to minimize overfitting.
Fig. 2.

Schematic representation of neural network architecture showing the input, hidden, and output layers. Each neuron processes weighted inputs through activation functions to generate predictive outputs.
Decision tree (DT)
The Decision Tree (DT) algorithm is a non-parametric, supervised learning model that partitions the dataset into hierarchical nodes based on predictor variable values21. For regression tasks, each leaf node represents the mean predicted value of samples within that node (Fig. 3). Splitting criteria were based on minimizing mean squared error (MSE) across nodes, providing an interpretable model structure for understanding predictor-response relationships.
Fig. 3.
Sample structure of a regression decision tree illustrating the hierarchical splitting of meteorological variables used to predict air pollutant concentrations (www.hackerearth.com).
Random forest (RF)
The Random Forest (RF) algorithm is an ensemble method that aggregates predictions from multiple decision trees trained on bootstrapped subsets of the dataset22. Each tree is constructed using a random selection of predictor features to enhance model diversity and reduce overfitting. The final prediction is derived from the average output of all trees. The importance of each feature was quantified based on the mean decrease in impurity using the Gini index, allowing identification of the most influential meteorological parameters.
Gradient boosting (GB)
The Gradient Boosting (GB) algorithm constructs an additive ensemble of weak learners (typically decision trees) to minimize model residuals through gradient descent optimization23. Each new tree is trained to predict the residual errors from the previous model iteration, progressively improving accuracy. A learning rate parameter was applied to control the contribution of each tree and prevent overfitting. GB models are particularly effective in capturing subtle nonlinear interactions within small to moderately sized datasets.
Model evaluation and validation
To ensure robust and comparable evaluation, all models were assessed using four statistical performance metrics: Mean Squared Error (MSE) average of squared prediction deviations, sensitive to large errors. Root Mean Squared Error (RMSE) interpretable measure of prediction error magnitude in pollutant units. Mean Absolute Error (MAE) average magnitude of absolute prediction errors, indicating overall bias. Coefficient of Determination (R²) proportion of variance in observed data explained by the model. The best-performing algorithm for each pollutant was determined by the combination of lowest MSE, RMSE, and MAE, and the highest R² value. Additionally, comparative ranking of meteorological variables was conducted to identify the most influential predictors for NO₂, CO, and PM₁₀ concentrations.
Hyperparameter selection and tuning
Hyperparameter selection was conducted to ensure fair comparison among models while minimizing overfitting. Model tuning was performed using built-in optimization and cross-validation tools available in Orange (version 3.36.1). For each algorithm, key hyperparameters were adjusted iteratively (Table 1), and optimal values were selected based on minimizing the validation MSE. To maintain consistency, identical training-validation splits were applied across all models. Ensemble methods were configured to balance model complexity and generalization, while regularization strategies were applied where applicable to reduce overfitting.
Table 1.
Hyperparameter settings used for machine learning models.
| Algorithm | Hyperparameter | Value / Range | Description |
|---|---|---|---|
| NN | Hidden layers | 1 | Single hidden layer |
| Neurons | 50–100 | Optimized via cross-validation | |
| Activation function | ReLU | Nonlinear mapping | |
| Optimizer | Gradient descent | Backpropagation training | |
| Regularization | Dropout (0.2) | Overfitting control | |
| DT | Maximum depth | 10–15 | Controls tree complexity |
| Minimum samples per leaf | 5 | Reduces overfitting | |
| Split criterion | MSE | Regression-based splitting | |
| RF | Number of trees | 100 | Ensemble size |
| Maximum depth | None | Full growth allowed | |
| Features per split | Random feature selection | ||
| GB | Number of trees | 100–200 | Boosting iterations |
| Learning rate | 0.05–0.1 | Controls contribution of each tree | |
| Maximum depth | 3–5 | Weak learner constraint |
Results and discussion
Relationship between NO₂ and meteorological parameters
The relationship between NO₂ concentrations and meteorological variables revealed distinct nonlinear dependencies, as captured by all the four algorithms used. Figure 4 presents the predicted versus observed NO₂ concentrations using all seven meteorological parameters. Among the models, the RF and GB algorithms demonstrated superior predictive capability, while the DT and NN exhibited moderate performance. Quantitatively, the GB model achieved the lowest mean squared error (MSE = 2.07) and the highest coefficient of determination (R2 = 0.826) when three input features, day-of-year, dew point, and humidity were used (Table 2). The RF model followed closely (MSE = 2.39; R2 = 0.799). Conversely, the DT and NN models recorded relatively higher errors, reflecting their limited ability to generalize across the full range of meteorological variability. The superior performance of ensemble algorithms (RF and GB) can be attributed to their robustness in handling nonlinear, multivariate interactions typical of atmospheric datasets. Both models leverage multiple weak learners to minimize bias and variance, thus capturing subtle dependencies between meteorological predictors and pollutant concentration.
Fig. 4.
Performance comparison of four machine learning algorithms NN, DT, RF, and GB in predicting nitrogen dioxide concentrations from meteorological parameters.
Table 2.
Comparative performance metrics of four machine learning models (NN, DT, RF, and GB), applied to predict nitrogen dioxide concentrations using different combinations of meteorological inputs. The table summarizes key statistical indicators (MSE, RMSE, MAE and R2).
| Model | Day of Year | Day of year and dew point | Day of year; dew point and humidity | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | RMSE | MAE | R 2 | MSE | RMSE | MAE | R 2 | MSE | RMSE | MAE | R 2 | |
| NN | 5.178 | 2.276 | 1.765 | 0.564 | 5.053 | 2.248 | 1.702 | 0.575 | 5.054 | 2.248 | 1.688 | 0.575 |
| Tree | 7.998 | 2.828 | 1.707 | 0.327 | 4.493 | 2.12 | 1.406 | 0.622 | 7.158 | 2.675 | 1.611 | 0.398 |
| RF | 3.038 | 1.743 | 1.293 | 0.744 | 2.395 | 1.547 | 1.061 | 0.799 | 5.339 | 2.311 | 1.555 | 0.551 |
| GB | 2.071 | 1.439 | 1.101 | 0.826 | 2.412 | 1.553 | 1.133 | 0.79 | 2.436 | 1.561 | 1.119 | 0.795 |
From a physical standpoint, the strong dependence of NO2 on humidity and dew point highlights the influence of atmospheric moisture on pollutant chemistry. High humidity promotes heterogeneous reactions that enhance the conversion of NOx into secondary nitrogen compounds. Furthermore, seasonal variations captured by the day-of-year variable indicate temperature-driven photochemical processes, where higher solar radiation during summer enhances photolysis of NO2 to form ozone. Similar findings were reported by8] and [18, who observed comparable seasonal and humidity-dependent trends. Interestingly, including additional meteorological features beyond the top three did not significantly improve model accuracy, in fact MSE increased marginally. This suggests model saturation, where redundant or weakly correlated features introduce noise rather than additional explanatory power24. Therefore, optimal model complexity in environmental AI applications depends not only on algorithmic depth but also on judicious feature selection.
Relationship between CO and meteorological parameters
The relationship between CO concentration and meteorological variables exhibited a more diffuse predictive pattern than that of NO2 (Fig. 5). Table 3 summarizes the statistical indicators derived from each model. The Gradient Boosting (GB) algorithm consistently yielded the most reliable results (MSE = 136.47; R2 = 0.456), outperforming NN, DT, and RF models. While overall explanatory strength was moderate (R2 < 0.5 for most models), distinct meteorological influences were evident. The day-of-year, dew point, and weather condition variables emerged as the most relevant predictors. The temporal dependence indicates that CO levels vary seasonally, reflecting differences in combustion activity, temperature-driven atmospheric mixing, and photochemical oxidation rates. During cooler months, reduced atmospheric turbulence and lower boundary layer height enhance CO accumulation near the surface. Conversely, higher temperatures in summer accelerate oxidation to CO₂, reducing ambient CO concentrations.
Fig. 5.
Comparative performance of four machine learning algorithms NN, DT, RF, and GB in predicting carbon monoxide (CO) concentrations from meteorological inputs.
Table 3.
Statistical performance of four ML algorithms in predicting carbon monoxide (CO) concentrations based on meteorological parameters.
| Model | Day-of year | Day-of-year and W. Conditions | Day of year; dew point and Conditions | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | RMSE | MAE | R2 | MSE | RMSE | MAE | R2 | MSE | RMSE | MAE | R2 | |
| NN | 241.066 | 15.526 | 12.208 | 0.04 | 243.847 | 15.616 | 12.363 | 0.029 | 237.506 | 15.411 | 12.217 | 0.054 |
| Tree | 314.725 | 17.74 | 10.361 | − 0.254 | 249.776 | 15.804 | 8.874 | 0.005 | 116.709 | 10.803 | 6.178 | 0.535 |
| RF | 192.639 | 13.879 | 7.974 | 0.233 | 177.295 | 13.315 | 7.002 | 0.294 | 143.18 | 11.966 | 6.743 | 0.43 |
| GB | 186.364 | 13.652 | 7.472 | 0.258 | 167.417 | 12.939 | 7.143 | 0.333 | 136.471 | 11.682 | 6.553 | 0.456 |
The inclusion of dew point and weather condition variables generally enhanced model performance, particularly for decision tree and ensemble-based models, implying that atmospheric moisture and dispersion state significantly affect CO distribution. Higher dew points often correspond to stagnant air and limited vertical mixing, both of which promote CO persistence. Similar trends have been reported in previous studies conducted17, where low wind speeds and temperature inversions contributed to increased CO accumulation. The moderate predictive strength compared with NO2 suggests that CO levels are less sensitive to short-term meteorological fluctuations and more influenced by emission dynamics, particularly vehicular and industrial combustion sources. AI-based models can thus complement emission inventories by distinguishing meteorological effects from anthropogenic contributions. However, the relatively lower R2 values indicate that improving CO prediction accuracy would require incorporating traffic density, emission rates, and atmospheric boundary layer height parameters that were not included in the present dataset.
Relationship between PM10 and meteorological parameters
The relationship between PM10 and meteorological factors was generally weak across all algorithms (Fig. 6). As summarized in Table 4, none of the models achieved satisfactory performance, with R2 values ranging between − 0.3 and 0.29 and high MSE values (> 250). This poor model performance indicates that PM10 concentrations in the study area are governed by highly variable and non-stationary factors, many of which are not fully captured by basic meteorological parameters.
Fig. 6.
Comparison of modeled and observed PM₁₀ concentrations obtained from NN, DT, RF, and GB algorithms.
Table 4.
Performance summary of the four models used in predicting particulate matter (PM₁₀) concentrations from meteorological inputs.
| Model | Wind speed | Day-of-year and Wind speed | Day of year; wind speed and humidity | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | RMSE | MAE | R2 | MSE | RMSE | MAE | R2 | MSE | RMSE | MAE | R2 | |
| NN | 487.388 | 22.077 | 12.436 | -0.008 | 369.014 | 19.21 | 8.753 | 0.237 | 347.012 | 18.628 | 8.190 | 0.282 |
| Tree | 669.843 | 25.881 | 14.894 | -0.385 | 476.886 | 21.838 | 12.320 | 0.014 | 469.756 | 21.674 | 12.252 | 0.029 |
| RF | 645.276 | 25.042 | 14.134 | -0.334 | 293.139 | 17.121 | 8.228 | 0.394 | 344.043 | 18.548 | 8.218 | 0.289 |
| GB | 521.082 | 22.827 | 12.848 | -0.078 | 867.641 | 29.456 | 13.765 | -0.974 | 954.888 | 30.901 | 12.762 | -0.975 |
The three best-performing predictors wind speed, day-of-year, and humidity provided marginal improvements in predictive capability when combined. The NN and RF algorithms performed slightly better than DT and GB, though still with limited accuracy. The negative R2 values observed for several models suggest that linear and ensemble-based approaches struggled to generalize PM10 patterns in the dataset25. Several factors may explain these outcomes. PM10 levels in arid regions such as Dammam are strongly influenced by non-linear dust resuspension events, which depend on threshold wind speeds, soil moisture, and surface roughness. While moderate winds facilitate dispersion, strong gusts (> 6 m/s) can mobilize large quantities of mineral dust, temporarily elevating PM10 concentrations independent of humidity or temperature. Also, sporadic local activities such as construction, road traffic, and industrial processes can introduce abrupt emission spikes that disrupt model continuity. These stochastic influences are difficult to capture with purely meteorology-based models. The weak model performance therefore highlights a limitation of AI correlation analyses when applied to heterogeneous particulate datasets. Similar findings were reported by18, who noted that PM levels in coastal and desert regions exhibit poor predictability based solely on meteorological inputs. Incorporating aerosol optical depth (AOD), soil particle index, and surface dust emission data could substantially enhance future models.
The observed partial correlations suggest certain meteorological controls: higher relative humidity enhances particle growth through hygroscopic processes, increasing measured PM10 mass. Moderate winds promote dispersion, whereas strong winds increase dust entrainment. The day-of-year variable reflects recurring patterns of dust storm frequency and intensity, which typically peak during late spring and summer. These results emphasize the multifactorial nature of PM10 dynamics and the need for hybrid models that combine meteorological predictors with emission and land-surface data for improved reliability. The low and, in some cases, negative R2 values observed for PM10 prediction indicate that models driven solely by standard meteorological variables are unable to adequately represent the dominant controls on particulate matter variability in arid environments. Unlike gaseous pollutants, PM10 concentrations in desert regions are strongly influenced by episodic and non-linear dust resuspension events, which depend on surface conditions, soil moisture, land cover, and threshold wind speeds rather than on gradual meteorological variations alone. Negative R2 values therefore reflect the inability of purely meteorology-based models to generalize PM10 behavior, rather than numerical instability or algorithmic inadequacy.
Comparative assessment of algorithmic performance
The comparative evaluation of the four algorithms revealed clear differences in their predictive accuracy, robustness, and capacity to capture nonlinear dependencies between meteorological parameters and air pollutant concentrations. These differences reflect both the intrinsic properties of the algorithms and the nature of the atmospheric data analyzed. Overall, ensemble-based algorithms (RF and GB) consistently outperformed individual learners (DT and NN) across all pollutant categories. The GB model achieved the highest accuracy, particularly for NO2 and CO prediction, demonstrating its superior capability to model subtle nonlinear interactions and compensate for data heterogeneity. The model’s iterative learning process where successive trees correct the residual errors of previous ones enables it to systematically minimize bias and variance, thereby achieving better generalization in small to medium-sized datasets. This is particularly advantageous for environmental data, where nonlinear meteorological-pollutant interactions are common and measurement noise is often unavoidable.
The RF model also performed reliably, ranking second overall. Its bagging approach, which constructs numerous decision trees trained on random subsets of the data and predictor features, provides high stability against overfitting and robust handling of multicollinearity among meteorological variables. RF’s capability to compute feature importance further enhances interpretability, allowing identification of the dominant meteorological drivers for each pollutant. For instance, humidity and dew point consistently ranked among the top predictors for NO₂, while temperature and wind speed were most influential for CO and PM10. By contrast, the DT algorithm showed comparatively lower accuracy. Although its hierarchical structure facilitates intuitive interpretation, its performance deteriorated in the presence of continuous, noisy, and interdependent environmental variables. The single-tree framework is prone to high variance and overfitting, leading to inconsistent predictions when applied to heterogeneous meteorological datasets. The NN model also underperformed relative to ensemble methods. This can be attributed to the limited sample size available which constrains the NN ability to learn complex mappings without overfitting. Neural networks generally require larger and more diverse datasets to effectively approximate nonlinear functions, particularly when multiple hidden layers and activation functions are involved. Nevertheless, the NN model showed potential when key features such as humidity, dew point, and wind speed were emphasized suggesting that further optimization with larger datasets and additional regularization techniques could enhance its predictive strength.
A clear trend across all models was the diminishing return of adding more meteorological parameters. As the number of predictors increased beyond three to four, the models exhibited higher MSE and lower R2 values. This effect reflects the classical curse of dimensionality in machine learning, where redundant or weakly correlated features introduce noise and reduce overall generalization. Therefore, feature selection and dimensionality reduction remain critical preprocessing steps in environmental modeling. Retaining only meteorological parameters with strong physical and statistical relationships to pollutant dynamics such as humidity, wind speed, and temperature enhances both accuracy and computational efficiency. From a theoretical perspective, the superior performance of GB and RF supports growing evidence that ensemble learning methods are better suited for environmental systems characterized by nonlinearity, stochasticity, and feedback loops26. These models provide a balance between flexibility and interpretability, making them ideal for translating complex atmospheric processes into quantitative predictive frameworks.
Environmental and modeling implications
The results of this study have significant implications for both environmental management and the advancement of data-driven modeling in atmospheric sciences. The differential response of pollutants to meteorological factors shows the need for pollutant-specific predictive frameworks, rather than a single universal air quality model. From an environmental standpoint, the findings reveal that NO2 concentrations are primarily governed by atmospheric moisture content and seasonal variation. The strong predictive relationship with humidity and dew point suggests that local meteorological conditions substantially influence nitrogen oxide chemistry, particularly through reactions involving aqueous aerosols and heterogeneous catalysis. This implies that even in arid regions such as the Eastern Province of Saudi Arabia, transient increases in atmospheric humidity associated with coastal winds or early morning dew formation can intensify secondary pollutant formation. Consequently, integrating high-resolution meteorological monitoring with real-time air quality forecasting could enable early warnings of ozone and NO2 pollution episodes, which are of major health concern.
The moderate relationship of CO with temperature and atmospheric mixing conditions indicates that pollutant persistence is strongly tied to boundary layer dynamics. The relatively low R2 values suggest that meteorological parameters alone cannot fully explain CO variability, emphasizing the dominant role of anthropogenic emissions. Incorporating real-time traffic data and emission inventories would substantially enhance model performance. Nevertheless, the current results demonstrate that ensemble-based AI models can successfully disentangle the meteorological contribution from background emission signals, enabling more accurate source attribution analyses. The weak correlations observed for PM10 emphasize the complexity of dust-dominated environments. In arid regions, PM10 variability arises from both meteorological forcing (e.g., wind gusts, humidity, and surface dryness) and episodic dust entrainment from natural and anthropogenic activities. The stochastic nature of these processes explains the poor predictive performance of models trained exclusively on standard meteorological inputs. To improve PM₁₀ predictability, future studies should adopt more comprehensive modeling frameworks that integrate meteorological predictors with additional explanatory variables. These may include land-surface parameters (e.g., soil moisture, vegetation cover, and surface roughness), emission-related factors (e.g., construction activity, traffic density, and industrial sources), and satellite-derived products such as aerosol optical depth (AOD) and dust indices. Furthermore, hybrid modeling approaches that combine physical dust emission and transport models with machine learning techniques may offer a more robust representation of PM₁₀ dynamics. Such hybrid frameworks can leverage physical process understanding while retaining the flexibility of data-driven learning, thereby improving predictive reliability in dust-dominated arid regions.
Beyond pollutant-specific observations, the study provides broader methodological insights into ML applications for atmospheric research. The superior performance of the GB algorithm indicates that hybrid models combining ensemble learning with feature selection techniques can effectively balance predictive accuracy and interpretability15. The implications of this study extend to policy and operational forecasting. The demonstrated ability of ML to predict air quality parameters based on routine meteorological measurements provides a cost-effective alternative for regions with limited sensor networks. Implementing such models in regional air quality management systems could support Short-term pollution forecasting to inform public health advisories, the emission control strategies, especially for transport and industrial sources as well as climate adaptation planning, by linking pollutant dynamics with extreme meteorological events (e.g., dust storms, temperature inversions).
Conclusions and recommendation
This study applied four algorithms to investigate the complex relationships between meteorological conditions and air quality parameters (NO2, CO, and PM10) in the Dammam Metropolitan Area, Eastern Saudi Arabia. Using five years (2017 to 2021) of meteorological and air quality data, the models were trained and evaluated to determine their predictive performance and to identify the dominant meteorological controls on pollutant variability. The following key conclusions can be drawn:
The GB and RF algorithms demonstrated superior predictive accuracy and stability compared with NN and DT, confirming the effectiveness of ensemble-based approaches in capturing nonlinear meteorological-pollutant interactions and mitigating overfitting in relatively small datasets.
The GB model achieved the highest predictive accuracy (R2 ≈ 0.83), identifying humidity, dew point, and seasonal variation (day-of-year) as the most influential variables. These findings highlight the critical role of moisture and photochemical activity in modulating nitrogen oxide concentrations, even under arid climatic conditions.
While atmospheric temperature and mixing conditions influenced CO levels, the results suggest that emission intensity and combustion activity remain the primary determinants. Integrating emission inventories and traffic data is therefore essential for improving the model.
The limited predictability of PM10 concentrations shows the dominance of non-linear and episodic dust processes in arid regions. Future models should incorporate additional datasets such as aerosol optical depth, soil moisture, and surface wind thresholds to better capture dust entrainment dynamics.
The inclusion of only the most relevant meteorological parameters enhanced accuracy, while the addition of redundant features led to increased model error, illustrating the importance of dimensional control and feature ranking in environmental AI applications.
The demonstrated capability of GB and RF algorithms to predict pollutant behavior using standard meteorological inputs supports their integration into air quality management frameworks. Such models can inform early warning systems, emission mitigation strategies, and environmental policy decisions in data-limited regions.
By validating the applicability of ensemble machine learning techniques under arid conditions, this research expands the global evidence base for AI-driven air quality modeling and highlights the potential of hybrid approaches that merge data-driven learning with physical atmospheric processes. Future research should focus on hybridizing AI with physical atmospheric models, incorporating emission inventories and satellite-based inputs, to enhance predictive reliability and to advance integrated air quality and climate modeling efforts.
Author contributions
B T: Conceptualization, Methodology, Visualization, Validation, Formal analysis, Investigation, Resources, Writing : Original Draft, Writing Review & Editing.
Data availability
The data used in this study are available at the following link[https://doi.org/10.5281/zenodo.17927978](https:/doi.org/10.5281/zenodo.17927978).
Declarations
Competing interests
The author declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Baccarelli, A. et al. Air pollution exposure and lung function in highly exposed subjects in Beijing, china: a repeated-measure study. (2014). http://nrs.harvard.edu/urn-3:HUL.InstRepos:13347421 [DOI] [PMC free article] [PubMed]
- 2.Inam, S. A. A review of artificial intelligence for predicting climate driven infectious disease outbreaks to enhance global health resilience. Discov Public. Health. 22, 738 (2025). [Google Scholar]
- 3.Girotti, C. et al. Air pollution dynamics: the role of meteorological factors in PM10 concentration patterns across urban areas. City Environ. Interact.25, 100184 (2025). [Google Scholar]
- 4.Silva, T. et al. North African dust intrusions and increased risk of respiratory diseases in Southern Portugal. Int. J. Biometeorol.65, 1767–1780 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang, H., Wang, Y., Hu, J., Ying, Q. & Hu, X. M. Relationships between meteorological parameters and criteria air pollutants in three megacities in China. Environ. Res.140, 242–254 (2015). [DOI] [PubMed] [Google Scholar]
- 6.Jenkin, M. E. & Hayman, G. D. Photochemical Ozone creation potentials for oxygenated volatile organic compounds: sensitivity to variations in kinetic and mechanistic parameters. Atmos. Environ.33, 1275–1293 (1999). [Google Scholar]
- 7.Nejad, M. T., Ghalehteimouri, K. J., Talkhabi, H. & Dolatshahi, Z. The relationship between atmospheric temperature inversion and urban air pollution characteristics: a case study of Tehran, Iran. Discov Environ.1, 17 (2023). [Google Scholar]
- 8.Zhang, L. et al. Impact of air humidity fluctuation on the rise of PM mass concentration based on the High-Resolution monitoring data. Aerosol Air Qual. Res.17, 543–552 (2017). [Google Scholar]
- 9.Essa, K. S. M., Mubarak, F. & Elsaid, S. E. M. Effect of the plume rise and wind speed on extreme value of air pollutant concentration. Meteorol. Atmospheric Phys.93, 247–253 (2006). [Google Scholar]
- 10.Wang, R., Cui, K., Sheu, H. L., Wang, L. C. & Liu, X. Effects of precipitation on the air quality Index, PM2.5 levels and on the dry deposition of PCDD/Fs in the ambient air. Aerosol Air Qual. Res.23, 220417 (2023). [Google Scholar]
- 11.Puth, M. T., Neuhäuser, M. & Ruxton, G. D. Effective use of spearman’s and kendall’s correlation coefficients for association between two measured traits. Anim. Behav.102, 77–84 (2015). [Google Scholar]
- 12.Duch, W. & Mandziuk, J. Challenges for Computational IntelligenceDuch, W. and Mandziuk, J., Eds.; [Book reviews]. IEEE Trans. Neural Netw. 20, 542–543 (2009). (2007).
- 13.Inam, S. A. et al. PR-FCNN: a data-driven hybrid approach for predicting PM2.5 concentration. Discov Artif. Intell.4, 75 (2024). [Google Scholar]
- 14.Inam, S. A., Zaidi, S. M. H., Khan, A. A. & Ullah, S. A neural network approach to carbon emission prediction in industrial and power sectors. Discov Appl. Sci.7, 640 (2025). [Google Scholar]
- 15.Kakani, V., Nguyen, V. H., Kumar, B. P., Kim, H. & Pasupuleti, V. R. A critical review on computer vision and artificial intelligence in food industry. J. Agric. Food Res.2, 100033 (2020). [Google Scholar]
- 16.Arhami, M., Kamali, N. & Rajabi, M. M. Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations. Environ. Sci. Pollut Res.20, 4777–4789 (2013). [DOI] [PubMed] [Google Scholar]
- 17.Chadalavada, S. et al. Application of artificial intelligence in air pollution monitoring and forecasting: A systematic review. Environ. Model. Softw.185, 106312 (2025). [Google Scholar]
- 18.Cheng, Y. et al. A Second-Order Closure Turbulence Model: New Heat Flux Equations and No Critical Richardson Number. https://doi.org/10.1175/JAS-D-19-0240.1 (2020).
- 19.Hamrani, A., Akbarzadeh, A. & Madramootoo, C. A. Machine learning for predicting greenhouse gas emissions from agricultural soils. Sci. Total Environ.741, 140338 (2020). [DOI] [PubMed] [Google Scholar]
- 20.DAWSON, C. W. & WILBY, R. An artificial neural network approach to rainfall-runoff modelling. Hydrol. Sci. J.43, 47–66 (1998). [Google Scholar]
- 21.Teli, S. & Student, M. T. A survey on decision tree based approaches in data mining. in (2015).
- 22.Breiman, L. Random forests. Mach. Learn.45, 5–32 (2001). [Google Scholar]
- 23.Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat.29, 1189–1232 (2001). [Google Scholar]
- 24.Madukpe, V. N., Ugoala, B. C. & Zulkepli, N. F. S. Topological approach and kernel principal component analysis for air pollution source apportionment. Int. J. Environ. Res.19, 260 (2025). [Google Scholar]
- 25.Inam, S. A., Rajput, H. & Umer, S. An explainable hybrid stacked deep learning framework for forecasting PM10 concentrations in urban air. Explora Environ. Resour.2, 025380069 (2025). [Google Scholar]
- 26.Mehrani, M. J. et al. Application of a hybrid mechanistic/machine learning model for prediction of nitrous oxide (N2O) production in a nitrifying sequencing batch reactor. Process. Saf. Environ. Prot.162, 1015–1024 (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data used in this study are available at the following link[https://doi.org/10.5281/zenodo.17927978](https:/doi.org/10.5281/zenodo.17927978).





