Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2022 Nov 19;12:19949. doi: 10.1038/s41598-022-24470-5

Identifying a suitable model for predicting hourly pollutant concentrations by using low-cost microstation data and machine learning

Rongjin Yang 1,#, Lizeyan Yin 3,#, Xuejie Hao 2, Lu Liu 2, Chen Wang 2, Xiuhong Li 2,, Qiang Liu 2
PMCID: PMC9675857  PMID: 36402807

Abstract

Accurately predicting the concentration of PM2.5 (fine particles with a diameter of 2.5 μm or less) is essential for health risk assessment and formulation of air pollution control strategies. At present, there is also a large amount of air pollution data. How to efficiently mine its hidden features to obtain the future concentration of pollutants is very important for the prevention and control of air pollution. Therefore we build a pollutant prediction model based on Lightweight Gradient Boosting Model (LightGBM) shallow machine learning and Long Short-Term Memory (LSTM) neural network. Firstly, the PM2.5 pollutant concentration data of 34 air quality stations in Beijing and the data of 18 weather stations were matched in time and space to obtain an input data set. Subsequently, the input data set was cleaned and preprocessed, and the training set was obtained by methods such as input feature extraction, input factor normalization, and data outlier processing. The hourly PM2.5 concentration value prediction was achieved in accordance with experiments conducted with the hourly PM2.5 data of Beijing from January 1, 2018 to October 1, 2020. Ultimately, the optimal hourly series prediction results were obtained after model comparisons. Through the comparison of these two models, it is found that the RMSE predicted by LSTM model for each pollutant is nearly 50% lower than that of LightGBM, and is more consistent with the fitting curve between the actual observations. The exploration of the input step size of LSTM model found that the accuracy of 3-h input data was higher than that of 12-h input data. It can be used for the management and decision-making of environmental protection departments and the formulation of preventive measures for emergency pollution incidents.

Subject terms: Environmental sciences, Environmental social sciences

Introduction

PM2.5 refers to the particulate matter with an aerodynamic equivalent diameter less than or equal to 2.5 μm in the ambient air1, which is the main monitoring object of the National Air Quality Monitoring Station. It is the pollutant produced by human production and life that exceeds its own purification capacity and may have an impact on the environment2. PM2.5 is a category which covers a broad range of pollutants, including those produced by human activities, those produced by natural processes (e.g., desert dust), and as the result of chemical and physical processes in the atmosphere (e.g., molecules aggregating together to form particles). The effects of PM2.5 on human health have been extensively studied3. No matter short-term outbreak or long-term accumulation of this pollutant, it will have an important impact on mankind. In particular, the smog caused by PM2.5 not only makes the weather cloudy and with low visibility, which will cause greater hidden dangers to people’s travel safety4, but also increases the mortality rate of diseases related to the respiratory, cardiovascular, and nervous systems5,6. In addition, localized air pollution may also have an impact on regional and even global climate change7, and then may cause other environmental and health problems8,9. Therefore, it has become a global consensus that the monitoring and prediction of PM2.5 pollutants are extremely important.

As the capital of China, Beijing is densely populated and seriously affected by air pollution. The frequent outbreak of severe weather phenomena such as haze and sandstorm and the accompanying increase of respiratory diseases are particularly urgent for such a first tier city with economic development and dense population. Therefore, its air pollution problem has become the focus of people's attention.Therefore, it is necessary to use machine learning method to predict PM2.5 concentration. Yan Xing et al. Improved the precision of PM2.5 concentration inversion based on MODIS sensor by using the deep learning network. The spatial–temporal distribution characteristics of PM2.5in Beijing were obtained through the analysis of the retrieved high spatial–temporal PM2.5 results45.The spatial distribution trend of pollutants depends on accurate ground station data.

Air pollution prediction research has experienced a development process from qualitative analysis to quantitative modeling from the 1960s to the present. In 1960, Lawrence E qualitatively described the characteristics of weather conditions such as wind direction and atmospheric stability under the condition of poor air quality data, and speculated that the high incidence period of air pollution could be estimated based on the prediction of weather conditions. Although there was no quantitative equation, this exploration laid a theoretical foundation for the subsequent emergence of quantitative analysis models, especially numerical prediction models10.

Time series prediction analysis is a mathematical method to reason about the performance results of the upcoming periods based on all the laws and characteristics of past materials and data11, which has been widely used in various fields, including the economic market12, energy consumption13, biomedicine14, environmental monitoring15. According to the principle of model construction, the time series prediction models of air pollutants are divided into two types: mechanism models and non-mechanism models.

The mechanism model simulates the transformation and diffusion processes of pollutants in the air based on atmospheric dynamics. The movement of pollutants in the horizontal and vertical directions, the emission of different pollution sources, and the physical and chemical properties evolutions of pollutants in the air are fully considered. Commonly used mechanism models include Nested Air Quality Prediction Modeling System (NAQPMS) and City Air Pollution Numerical prediction System (CAPPS) independently developed by China, as well as the world-widely used CMAQ16,17, CAMx18, WRF- Chem19, ADMS20, CHIMERE et al.21. The prediction system in China can not only predict the pollutants of a single element such as PM2.5, SO2, O3, et al. on a regional scale22, but also simulate the occurrence of pollution. Wu Ying et al. analyzed the prediction effects of Ozone in Taizhou via the NAQPMS and CMAQ models and found that the performance of the two models in different seasons has their own advantages and disadvantages, and the overall prediction effects are both within the ideal range23. Ma Siqi et al. used four models of WRF-Chem, CHIMERE, CMAQ and CAMx to simulate the sandstorm weather in Northeast China and obtained the performance effects of each model under different parameter configurations. Although there are slight differences between the predicted results of each model and the observed PM10 (Inhalable Particles, usually refers to particles with a particle size below 10 microns) concentration, each model has relatively truly restored the occurrence process of sandstorm24. Taking more comprehensive factors into account, mechanism models express the entire process of pollutant generation, transportation, transformation and dissipation by a parameterized equation, which is more in line with the actual emission situation. However, it is necessary to consider factors such as complex and changeable meteorological fields, pollutant emission inventories, and geographical features when constructing a numerical forecast model. Thus, model construction is difficult for people who do not have the knowledge of traditional meteorology. Furthermore, due to simplification effects, lack of parameters or unrepresentative observations, it may not be possible to simulate atmospheric diffusion under stable conditions, which usually results in low prediction accuracy25.

The prediction of the non-mechanism model does not require complex parameters and accurate physical and chemical equations. It is dedicated to better prediction results without considering the mechanism process. Through statistical learning of massive historical pollutant data, it summarizes the law of concentration changes and predicts the pollutant concentration for a period in the future. Commonly used statistical models include generalized linear regression (LR), autoregressive integrated moving average (ARIMA), projection tracking model (PP), principal component analysis (PCA), support vector regression (SVR), et al.26, all of which realize the function of prediction by establishing linear regression relationship between input time series pollutant data and output results. These models have also achieved good results in some research. Zhang Yuli et al. constructed a power function multiple linear regression model of PM2.5 in Shanghai. After cross-validation, the correlation coefficient was concluded as 0.94, the root means square error as 1. Moreover, since the fitting relationship between the predicted result graph and the true value was good, it can be used as the prediction model under the ideal state and provides relevant control recommendations to the local government27. Peng Sijun et al. conducted the prediction in the ARIMA model by using Wuhan's PM2.5 daily average concentration data. Comparing with the results of the gray model, they obtained the better effect of the segmented time series prediction in the short-term PM2.5 prediction28. Bing-Chun Liu et al. carried out a collaborative prediction on the Air Quality Index (AQI) in the Beijing-Tianjin-Hebei region in the SVR model, and found that the MAPE (The mean absolute percentage error. It is the descriptive accuracy. Because the mean absolute percentage error itself is often used as a statistical measure of forecast accuracy, such as time series forecasting) in all cases was between 0.05 and 0.09, which means the prediction results are highly reliable29. Although they have performed well in the prediction of air pollutants, they still have some shortcomings compared with other nonlinear technologies30. Because the pollutant time series are not simply linear relationships, the influence of other factors such as wind speed, wind direction, and human activities are also involved. Comparing the pros and cons of linear and non-linear methods in predicting the concentration of PM10, Abdullah, S. et al. concluded that the error range of the non-linear model in predicting the concentration of particulate matter was reduced by at least 30%, no matter in rural, suburban or urban areas. Meanwhile, the artificial neural network can generate more accurate data of PM1031. Thus, scholars have mostly focused on non-linear models in the study of predictive models in recent years.

Machine learning (ML), as an intelligent learning method that integrates multidisciplinary knowledge and uses computers to simulate human activities32, gives full play to its advantages in fitting non-linear problems, especially its ability to automatically classify and identify and efficiently process and analyze data in the current era of big data. Decision trees, random forests, Bayesian learning, artificial neural networks, et al. are all core algorithms of machine learning, which have been applied to air pollution prediction research by many scholars at home and abroad. The first several algorithms were used in the prediction of air pollutants earlier because of their relatively simple structure and easy implementation. Gocheva-Ilieva et al. proposed a general method to establish a nonlinear model of environmental time series quality by using the powerful data mining technology of Classification and Regression Tree (CART), and the results are in good agreement with the measured data. CART is better than ARIMA33 in predicting the concentration of PM10 in Europe. Ren Cairong and Xie Gang predicted the PM2.5 concentration in Taiyuan based on random forests and meteorological data. Model verification showed that random forests model has better accuracy and recall rate34. Sujit K. Sahu et al. came up with a Bayesian hierarchical space–time model to predict Ozone concentration in the eastern United States, and found that the data obtained by the new model was more accurate than the model results based on only Eta-CMAQ prediction data. The time resolution was improved, and the prediction of the concentration value in the space position was more accurate35. Osowski, Stanislaw and Garanty, Konrad predicted atmospheric pollution days in northern Poland in the methods of support vector machines and wavelet decomposition, and found that the prediction results were in good agreement with the actual measured values, no matter the pollutant type was NO2, CO, SO2 or dust36.

Compared with the above-mentioned machine learning algorithms, artificial neural networks have the characteristics of strong fault tolerance and dynamic stability37, that is, the requirements for input data are relatively low, which does not have to be continuous and perform smoothly on external influences. Artificial neural networks contain a large number of nodes, consisting of an input layer, an output layer, and at least one hidden layer. As a result, this model can perform highly complex mappings on nonlinear data, thereby inferring the subtle relationship between the input data set and the output parameters. At present, artificial neural networks have many model classifications, including feedforward neural network (FNN), back propagation (BP) algorithm, recurrent neural network (RNN), et al.38. With fast calculation speed and high prediction accuracy, they have been widely used in the field of air quality prediction and have achieved good results in the past few years. He Jianjun et al. used meteorological data, pollution emission data, circulation type data derived from WRF model and observation data to derive an ANN model to predict the daily concentration of SO2, NO2 and PM10 in Lanzhou, China. The results showed that the models can reproduce the pollution level and its daily changes well, and the correlation coefficients of the daily averages of the three pollutants ranged from 0.71 to 0.8339. Zhang Hong et al. used a BP neural network model with different air quality parameters to predict the temporal and spatial distribution of the annual average concentration of PM10 in Taiyuan. The prediction results of the model were consistent with the change trend of the observed value, and the correlation coefficient was 0.7240. Mohammad adopted a combination of ANN and Monte Carlo. Taking Tehran as a case, wind speed, temperature, relative humidity and wind direction were selected as the input variables of neural network models to simulate the concentration of five pollutants. The determination coefficient (R2) of simulated and observed carbon monoxide, nitrogen oxide, nitrogen dioxide, nitric oxide and PM10 pollutant levels is greater than 0.82, showing a high correlation, which also indicates that the method combined with ANNs and MCSs has a good application prospect in analyzing the uncertainty of air pollution prediction41.Grivas, G, et al. built a neural network model for hourly concentration prediction of PM10 in Athens, and the results were quite satisfactory. The R2 of the four-point independent test set was between 0.50 and 0.67, the value of the consistency index was between 0.80 and 0.89. Compared with the multiple linear regression model developed at the same time (R2 was between 0.29 and 0.35), the performance of the studied neural network model was superior42.

With the deepening of studies, a type of model that can explore the context of time series was introduced into the prediction of atmospheric pollutant concentration. Kangil Kim et al. applied the recursive network LSTM with memory structure to environmental time series problems, such as water pollution, air pollution and Ozone alarm. It turned out that the recursive network with memory had better predictive performance in non-stationary environments and long-term time lag conditions43. Yi-Ting Tsai et al. proposed to predict the concentration of PM2.5 based on LSTM, and conducted an evaluation experiment of hourly PM2.5 concentration prediction at 66 stations in Taiwan, the results of which proved that this method can effectively predict the value of PM2.544.

In summary, shallow machine learning, such as decision trees, can be used to predict the concentration of air pollutants, and the prediction performance of the CART algorithm has been evaluated. LightGBM, which is also a decision tree model, has similar results to neural networks when processing massive data features, with fast processing speed and less memory. As a kind of neural network algorithm, the LSTM model has also made certain progress in the prediction of single pollutants such as PM2.5.

However, the comparison between the multiple pollutants prediction results of machine learning and neural network in the Beijing area is not clear. This research has conducted in-depth exploration and experiments in order to find the optimal time prediction model.

Materials and methods

Data collection

PM2.5 Data

The PM2.5 monitoring data selected in this study are hourly data of 34 stations (There are 35 original stations, but the data of the Botanical Garden Station is discarded due to serious lack of data in 2019 and 2020) on the webstation of Beijing Municipal Ecological Environment Monitoring Center (http://zx.bjmemc.com.cn/?timestamp = 1613378868776), from January 1, 2018 to October 1, 2020. The detailed information of the monitoring stations is shown in Table 1, and the unit is micrograms/cubic meter.

Table 1.

Informatica of air pollution monitoring station.

ID Station Longitude Latitude Sort ID Station Longitude Latitude Sort
1 Fangshan 116.136°E 39.742°N Suburb 19 Dongsi 116.417°E 39.929°N Main Urban
2 Daxing 116.404°E 39.718°N 20 Tiantan 116.407°E 39.886°N
3 Yizhuang 116.506°E 39.795°N 21 Guanyuan 116.339°E 39.929°N
4 Tongzhou 116.663°E 39.886°N 22 Flower nishinomiya 116.352°E 39.878°N
5 Shunyi 116.655°E 40.127°N 23 Olympic Sports Cente 116.397°E 39.982°N
6 Changping 116.23°E 40.217°N 24 Agriculture exhibition center 116.461°E 39.937°N
7 Mentougou 116.106°E 39.937°N 25 Wanliu 116.287°E 39.987°N
8 Pinggu 117.1°E 40.143°N 26 Northern New District 116.174°E 40.09°N
9 Huairou 116.628°E 40.328°N 27 Fengtai garden 116.279°E 39.863°N
10 Miyun 116.832°E 40.37°N 28 Yungang 116.146°E 39.824°N
11 Yanqing 115.972°E 40.453°N 29 Ancient city 116.184°E 39.914°N
12 Dingling 116.22°E 40.292°N Control Area (CA) 30 Qianmen 116.395°E 39.899°N Traffic Pollution (TP )
13 Badaling 115.988°E 40.365°N 31 Yongdingmen Inner 116.394°E 39.876°N
14 Miyun Reservoir 116.911°E 40.499°N 32 Xizhimen north 116.349°E 39.954°N
15 Donggao Village 117.12°E 40.1°N 33 South third ring road 116.368°E 39.856°N
16 Yongledian 116.783°E 39.712°N 34 East fourth ring 116.483°E 39.939°N
17 Yufa 116.3°E 39.52°N
18 Liulihe 116°E 39.58°N

Meteorological data

The concentration of PM2.5 pollutants is closely related to meteorological parameters. When the weather conditions such as wind speed and temperature are not easy to spread, the concentration of pollutants is greater. Therefore, this study obtained a total of 18 hourly monitoring stations on the ground in Beijing from the National Meteorological Science Data Center (http://data.cma.cn) and matched them with the latitude and longitude of the air pollution monitoring stations. The latitude and longitude information of the 18 stations is shown in Table 2 below. There are 6 data elements, namely 2-min average wind direction (WIN_D, unit: degree), 2-min average wind speed (WIN_S, unit: m/s), temperature (tem, unit: °C), relative humidity (RHU, unit: percentage), precipitation (PRE_1h, unit: millimeter), horizontal visibility (visibility, unit: meter). The 999999 or null in the meteorological monitoring data represent lack of observation due to factors such as monitoring equipment problems, network transmission, server storage, et al. 9999998 represents no observations. 999990 in the rainfall data represents a small amount of rainfall and 999017 in the wind direction data represents a quiet wind. These data values, as a kind of marker, are significantly higher than the normal monitoring data and need to be standardized.

Table 2.

Information of meteorological monitoring stations in Beijing.

Station Name Longitude Latitude Station Name Longitude Latitude
54,398 Shunyi 116.37°E 40.08°N 54,499 Changping 116.13°E 40.13°N
54,399 Haidian 116.17°E 39.59°N 54,501 Zhaitang 115.41°E 39.58°N
54,406 Yanqing 115.58°E 40.27°N 54,505 Mentougou 116.06°E 39.56°N
54,416 Miyun 116.52°E 40.23°N 54,511 Beijing 116.28°E 39.48°N
54,419 Huairou 116.38°E 40.22°N 54,513 Shijingshan 116.12°E 39.57°N
54,421 Miyun Shangdianqi 117.07°E 40.39°N 54,514 Fengtai 116.15°E 39.52°N
54,424 Pinggu 117.07°E 40.1°N 54,594 Daxing 116.21°E 39.43°N
54,431 Tongzhou 116.38°E 39.55°N 54,596 Fangshan 116.12°E 39.46°N
54,433 Chaoyang 116.3°E 39.57°N 54,597 Xiayunling 115.44°E 39.44°N

LightGBM and LSTM

As a neural network algorithm that can memorize sequence information, LSTM is the most widely used in time series forecasting. However, it is mostly a prediction attempt at a single site, and it does not make all predictions for multiple pollutants at all sites in the city. As an improved framework for shallow machine learning decision trees, LightGBM has similar effects to neural network algorithms in terms of processing speed and memory footprint. It is widely used in competitions such as search ranking and CTR prediction and is not currently used for air pollution related predictions.

LightGBM

The LightGBM algorithm uses a histogram-based feature ranking method. It divides continuous attribute features into discrete square columns, which reduces the Block structure and computational cost for storage compared with Pre-Sorted. LightGBM is another implementation framework of GBDT46, which is a more powerful algorithm that is more suitable for processing big data features. Compared with the XGBoost47 algorithm, the decision tree growth strategy used by LightGBM is the Leaf-wise method with depth restriction. The leaves of the same layer are not directly split, but the one with the largest gain is directly split. If the gain is small, the leaf node is not operated. In this way, a decision tree is finally formed. The results of the Leaf-wise strategy generate deeper trees with the same number of splits, and the loss function values are closer to the residuals. However, it is also prone to overfitting. In order to prevent this, the maximum depth is set.

The LightGBM algorithm is based on container features when calculating the gain after segmentation. Compared with the XGBoost algorithm performed on a single data feature, it runs faster and has a cache optimization function.

LightGBM solves the problem shared by GBDT and XGBoost that only by traversing all samples can the information gain be calculated to find the optimal division point. This problem makes the scalability and efficiency of the latter two algorithms unsatisfactory in massive data processing or high-latitude feature calculation. LightGBM combines one-sided gradient sampling (GOSS) and mutually exclusive feature binding (EFB) algorithms to reduce the amount of data and features, and to ensure the accuracy of regression. The LightGBM generation process is shown in Fig. 1:

Figure 1.

Figure 1

The LightGBM generation process.

LSTM

As the core algorithm of machine learning, neural network algorithm is mainly divided into three categories: feedforward neural network, feedback neural network and graph network. The first two algorithms belong to the hierarchical network structure, and the latter one belongs to the interconnected network structure. Like BP, FNN, and CNN for image classification, they all belong to feedforward neural networks, and the information of feedback neural networks can be bidirectional, unidirectional, and self-circulating. This means that it can receive input from neurons in previous layers as well as cyclic feedback from its own nodes, such as Recurrent Neural Networks (RNNs) and Hopfieid Networks.

Among them, RNN will have gradient disappearance and explosion problems, and Hochreiter and Schmidhuber proposed LSTM network. LSTM is an improved algorithm based on RNN that can store long-term data information. It adds three gates to control the choice of information based on the original structure of RNN. At the same time, a Cell state is added as a "long-term memory" throughout the entire sequence, and its structure diagram is shown in Fig. 2.

Figure 2.

Figure 2

Schematic diagram of the LSTM structure.

Summary of the data

Before conducting the time series prediction, we first analyzed the air pollution distribution in the study area to understand the trends and causes of changes in air pollution in Beijing in recent years and provide a basis for determining the input factors for time series prediction. This section mainly focused on analysis of the stations.

First, all the obtained hourly data was read and merged, placed in the same file, and then the format was converted into a table with time and station as rows and concentration of PM2.5 as columns. On this basis, the average results of different time scales are obtained. According to the classification of monitoring stations, the PM2.5 average values of four types, namely main urban areas, suburbs, traffic pollution points, and control area points were obtained. In addition, the concentration of PM2.5 was analyzed in the time series of year, season and day (March–May is spring, June–August is summer, September–November is autumn, and December-January is winter.), which will be expanded separately as below.

It can be seen from Figs. 3, 4, and 5 that the pollution peak of PM2.5 concentration was 261.5 μg/m3 in 2018, 277 μg/m3 in 2019 and 218 μg/m3 in 2020. The peak value was reduced by one pollution level, and no serious pollution occurred. The seasonal variation is characterized by that severe pollution occurred in winter and spring, and the concentration of pollutants in summer was the smallest in the year. The winter averages of 2018 and 2019 were 55.71 μg/m3 and 59.78 μg/m3, respectively, and the summer averages of 2018, 2019 and 2020 were 43.09 μg/m3, 33.72 μg/m3, and 31.31 μg/m3. Especially in the summer of 2020, the daily value was mostly 75 μg/m3 and below, and the emission of PM2.5 met the good standards of air quality.

Figure 3.

Figure 3

PM2.5 concentration change at each monitoring point in 2018.

Figure 4.

Figure 4

PM2.5 concentration changes at each monitoring point in 2019.

Figure 5.

Figure 5

PM2.5 concentration changes at all monitoring points in 2020.

The concentration of PM2.5 at different stations was different. The number of days in heavy pollution (150 μg/m3) and above is shown in Table 3 below. Compared with 2018, the number of days with severely polluted in 2019 decreased nearly a half. The number of pollution days in traffic pollution stations in 2018 and 2019 was much higher than that in the suburbs. In addition, the average value of suburban areas was also the lowest (Table 4). The number of pollution days at each monitoring station in 2020 was very low. In particular, the number of pollution days at traffic pollution stations dropped the most, compared to the previous two years, which may be related to the control of the COVID-19 and home office. According to statistics, the average value of PM2.5 was 52.96 μg/m3 in 2018 and 44.46 μg/m3 in 2019. The decrease was almost the same as the AQI, which was at about 15%, indicating that the PM2.5 control measures in Beijing and surrounding areas were effective and had already played a preliminary effect.

Table 3.

Days of heavy PM2.5 pollution in recent three years.

Year Reference point (day) Traffic pollution spot (day) The main (day) Suburban (day) Total (day)
2018 15 18 15 12 20
2019 9 9 6 4 12
2020 (As of August 31) 8 7 8 7 8

Table 4.

The annual average value of PM2.5 at each classified monitoring point.

Year Reference point Traffic pollution spot The main suburban
2018 55.12 54.98 51.83 49.898
2019 47.72 46.00 43.80 40.33

Proposed PM2.5 predictor

Classification of data set

The pre-processed and specially selected hourly data from January 1, 2018 to October 1, 2020 are divided into three categories for training, validating, and testing. The data from 2018 to June 30, 2019 is the training set, the data from July 1 to December 31, 2019 is the validation set, and the hourly data from 2020 to August 31 is the testing set. The data of the training and validation set is divided into input factors and output factors. The input factors include 6 meteorological parameters and 7 time characteristic parameters (holidays, working days, weekends, the first day of working days, the last day of working days, the first day of rest days and the last days of rest days). The output factor is the pollutant concentration. The test data set only includes 13 input factors. The predicted output result is the corresponding pollutant concentration.

Although same as the input factors, the difference in the quantity level between the meteorological parameters was relatively large, especially the quantity of the visibility was five digits, while that the wind speed was single digits. Since data of different dimensions participating in the training at the same time may affect the final prediction result, in order to verify the degree of this effect, the data was normalized.

Selection of error index

The selection of error index depended on different target tasks of LightGBM. For the regression task of this study, there were multiple choices, such as generally mean absolute error (MAE), mean square error (MSE), RMSE. RMSE prescribes the square root of MSE. With the same data dimension as our training data, RMSE can better describe data characteristics, and was generally used for machine learning model result evaluation. In this study, MAE and MSE were selected as the evaluation indicators of the loss function during the iterative process of the test set and the validation set, and RMSE was used for the final evaluation of the prediction results.

Adjust the parameters

There were many parameters of LightGBM. According to the function of the parameters, the parameters were adjusted in the following four steps.

First, the learning rate was determined. The second step was to modify the two parameters to improve the accuracy, namely the maximum depth of the tree and the number of leaf nodes, which together determined the complexity of the decision tree. The third step was to prevent over-fitting. The growth strategy of LightGBM made the tree converge faster, but it also increased the probability of overfitting. In the last step, in order to further improve the accuracy, the original learning rate was reduced to 0.01, 0.03, 0.005, et al. to calculate the RMSE result scores in turn. Finally, the model parameters for training with all the station data were determined as shown in Table 5 below, and the model parameters of a single station were debugged in the same way.

Table 5.

Key parameter Settings of LightGBM prediction model.

Parameter Parameter value list Name of parameter Parameter value list
num_boost_round 2663 max_depth 12
num_leaves 800 min_data_in_leaf 1
boosting_type gbdt bagging_fraction 0.9
learning_rate 0.005 feature_fraction 0.8
metric Loss function (‘l1’, ‘l2’) bagging_freq 1

Prediction of test data set

After the parameter settings, the above parameters were used for formal model training and validation, through which, the final decision tree model will be determined. Ultimately, the test data set was substituted for prediction to show the results of pollutant concentration in the future.

Denormalization

If the pollutant concentration was normalized during the test, the predicted data obtained would be also between 0 and 1. Therefore, it was necessary to restore the data to the original range. Suppose the predicted data is X_1, the minimum value (Min) of the original data column that 0 corresponds to and the maximum value (MAX) of the original data column which 1 corresponds to need to be firstly found, and then the original data would be restored via the function:

X=X1Max-Min+Min 1

The predicted maximum and minimum values of PM2.5 are replaced by the maximum and minimum values of PM2.5 in the original training data. Similarly, the predicted value range of PM10 and O3 were restored by replacing the maximum value of the training data.

Among them, the division of the data set, the selection of error index, and the normalization and de-normalization of the data were consistent with LightGBM. The additional processing parts of LSTM will be mainly introduced in the following part.

Processing of data set

Since LSTM required the input data to be a three-dimensional tensor, it was necessary to resample the input data to three-dimensional after the data set was classified and normalized. Before being converted into three dimensions, the data need to be converted into time-arranged supervision data, for LSTM relied on time series information. In the training process, the historical pollutant concentration data was involved. If the conversion was not performed, the future value would appear during the training process, so that the prediction model construction will not be correct. We took the following data as example to show the conversion process of supervision data. The data of 3 h including 16 features, namely the pollution concentration factor of the past three moments (including the current moment) and 13 future moments of meteorological and time features was input, and then the output data was the pollutant data of one hour in the future. The process was subsequently demonstrated. First, we marked the original data as time t, insert the first blank line at the top of the original data base as time t − 1, the second blank line as time t − 2, and a blank line at the bottom of the original data as t + 1 time. Then, we merged the four time columns into one table data, deleting the rows with null values. Afterwards, we obtained the final row of data as the supervised time series data. After that, we used the drop function to delete the meteorological and time features of the data at t − 1, t − 2 and t, and added the meteorological and time features at the next moment as input data, and the pollutant concentration at time t + 1 as the label item, the sequence data conversion process was completed.

The data dimension needs to be converted according to the number of samples, the input time and the features. For example, the original size of the PM2.5 data table of the Olympic Sports Center Station was (1752042). After the conversion, it became (17520,3,14), where 17520 was the number of samples, 3 was the input time, and 14 was the feature contained in the data at a time.

Construction of prediction models

The first step was to define the network, where three layers were set up. The input layer of the LSTM neural network had 64 neurons. The input size was 3 input time steps and 14 input features, which passed the result of each time step to the hidden layer. The LSTM hidden layer also set up 64 neurons, and only output the result of the last time step to the output layer. There was 1 neuron in the fully connected output layer, using a linear activation function.

Secondly, the network was compiled, with default configuration as parameters, MSE as the loss function, and ADAM as the optimization algorithm.

The third step was to train the data to adapt to the network, which involved two parameters, batch and epoch. All training samples were divided into several subsets. After all the samples in each subset were finished, the weight parameter would be updated once. The number of samples in this subset was called the batch size, which was set to 72 based on experience. The operation of training all subsets once and updating all gradients was called an epoch. We used four different times of 100, 50, 20, 10 for testing, and compared them with the MSE of the verification data set. It turned out that when the number of training cycles for all samples was 50, the loss function values of the two would overlap earlier (Fig. 6). After the coincidence, over-fitting phenomenon or reverse increase of the error may occur (Fig. 6). When epochs are equal to 20, most stations tend to converge around 20. The final number of iterations for each station adjusted according to this error curve.

Figure 6.

Figure 6

Error trends of training and test sets for the two sites at epoch 50.

In the last step, the test data was substituted into the trained model for prediction, and the final prediction effect was obtained through error evaluation.

PM2.5 predictor structure

Outlier handling

When the pollutant concentration was hourly predicted, the existence of outlier will have an important impact on the accuracy of the prediction. Therefore, the main steps of data cleaning for pollutant data were as follows:

The names of 34 stations in the data was obtained, which were used to calculate the missing data and outlier of each station in a loop; All days of the year and all hours of each day from the time series were obtained and stored for missing data interpolation and outlier judgment;

Two new empty arrays were created. One was used to store the time, with the same start and end time as that of the original time column. Its step length is one hour, ensuring the continuous output time. The other array had a length of 24*366 rows, and the number of columns was two fewer than that of original columns, which was used to record the data value corresponding to the moment;

Regarding all column data at a certain time, all data within one day before and after the current data value was firstly selected for judgment. If there were more than half of the missing data on the previous day or the next day, this time would be skipped. If there were four consecutive days of missing data, this time would also be skipped. If neither, the index would be recorded at that moment. Afterwards, whether the data was a null value was judged column by column. If it was a null value, other time would be replenished in line with the above filling method. If it was not a null value, whether it was an outlier would be determined in the interquartile method. The interquartile method is a statistical analysis method. It arranges all the values from small to large and divides them into four equal parts, which are located at three dividing points. Should it be marked as an outlier, then the value would be reset to empty and calculated as missing data.

When a moment was completed, the output file was written in the order of time, station, and concentration of PM2.5.

The data at the next moment would be sequentially judged until the last moment, looping through all the time data of this station.

The rest of the station data would be judged in the same method in turn, until all the data was completed, the output file would be saved and ended.

Time feature processing

In addition to meteorological conditions that affect the formation and diffusion of pollutants, traffic sources and human activities are also factors that affect the concentration of pollution. The pollution in different time periods is related to the frequency of travel on the day. Therefore, this study analyzed the characteristics of each time, indirectly indicating the intensity of human activities and traffic conditions that day.

Seven categories of statistics are made for each time in the weather data and pollutant data, which were holidays, working days, weekends, the first day of working days, the last day of working days, the first day of rest days and the last days of rest days. Weekends are easier to locate. We directly used the weekday function to perform weekly statistics on the current time. If the result was 5 and 6, it meant Saturday and Sunday. The categories of holidays were also easy to find. We stored all statutory holidays in an array "which_holiday". If it was in the array in turn, we would mark it as 1, otherwise mark it as 0. Working days needs to remove the statutory holidays from Monday to Friday, and then add the days we worked on Saturday and Sunday. Therefore, it was necessary to store the time to work on weekends in an array separately as "which_work". If the result processed by the weekday function was less than 5, and it was not in the array "which_holiday" but in the "which_work", it would be marked as 1, otherwise it was marked as 0. The same method was used to process the remaining four categories. Finally, each day from January 1, 2018 to October 2, 2020 was classified according to the above-mentioned category features, and 7 new feature columns were obtained.

Station matching

In addition, weather stations and air quality stations should match with each other. By importing the latitude and longitude of the two into ArcMap software, the neighboring stations were matched through the shortest distance. The matching results were shown in Table 6 below, which were stored in the same table.

Table 6.

Matching results of the weather station and air quality station.

ID Air quality station Weather station Type
1 Fangshan Fangshan Suburbs
2 Daxing Daxing
3 Yizhuang Beijing
4 Tongzhou Tongzhou
5 Shunyi Shunyi
6 Changping Changping
7 Mentougou Mentougou
8 Pinggu Pinggu
9 Huairou Huairou
10 Miyun Miyun
11 Yanqing Yanqing
12 Dongsi Chaoyang Six major urban areas
13 Tiantan Beijing
14 Guanyuan Haidian
15 Wanshouxigong Fengtai
16 Aotizhongxin Chaoyang
17 Nongzhanguan Chaoyang
18 Wanliu Haidian
19 Beibuxinqu Haidian
20 Fengtai Garden Fengtai
21 Yungang Fengtai
22 Gucheng Shijingshan
23 Dingling Changping Contrast point and area point
24 Badaling Yanqing
25 Miyun reservoir Shangdianzi
26 Donggao Village Pinggu
27 Yonglidian Tongzhou
28 Yufa Daxing
29 Liuli River Fangshan
30 Qianmen Beijing Traffic pollution monitoring point
31 Yongdingmennei Beijing
32 Xizhimenbei Haidian
33 Nansanhuan Fengtai
34 Dongsihuan Chaoyang

The matching process was introduced as follows. First, the names of all air quality stations were matched to the corresponding names of the weather stations in turn to obtain an initial matching station data. Then, 24 stations that did not have a corresponding name were saved as a list and matched according to the rules in Table 6. For example, the Olympic Sports Center, East Fourth, East Fourth Ring and Agricultural Exhibition Hall, which were all situated in Chaoyang District, were stored in one list. After that, a new table named “match” was created to store the wind speed and direction of weather stations in Chaoyang District. When the name of the air quality station was consistent with the name in the list, name of this station was changed to the station name same as the one in the list and appended to the original matching station data. The above operations were carried out in turn until all stations were matched. After the space stations were matched, the time of two data sets "station_id" and "UTC_time" would be automatically matched in the merge function. Finally, the output data after space–time matching was obtained.

After matching the meteorological data and pollutant data and time characteristics of each station, the correlation results among them were shown in Fig. 7. It can be seen from the figure that the relative humidity of meteorological data was negatively correlated with visibility. The positive correlation between AQI (Air Quality Index) and the concentration of PM2.5 in pollutant data was the strongest and reached 0.9. The primary factor affecting air quality was still PM2.5, followed by PM10, which was less than 0.1. In addition, it can also be seen that the meteorological factor that had a greater correlation with PM2.5 was visibility. In terms of time characteristics, the negative correlation between weekends and working days was the largest. Through the correlation analysis among the various factors, it can be concluded that the factors that affect the concentration of pollutants selected in this study were representative and had small overlap. At the same time, we acquired a certain understanding of the relationship among the various characteristics.

Figure 7.

Figure 7

Correlation between pollutant data and input factors after spatio-temporal matching.

Results and discussion

Comparison of the proposed PM predictor with LightGBM prediction methods

After the predictions of two models, the final prediction results were compared and displayed using the RMSE evaluation index. Finally, Yizhuang was selected as the suburbs, Guanyuan as the main urban area, Yufa as the control area, and East Fourth Ring Road as the traffic station, to display the time series predictions.

Prediction results and accuracy of LightGBM model at all stations

First, data of all stations were integrated into the same model for training, and three different input data types resulted in different prediction results. The accuracy statistics were shown in Table 7 below.

Table 7.

Comparison of prediction results of three pollutant parameters in LightGBM model for all sites.

RMSE_PM2.5
Full normalization Not normalized Input normalization
East fourth ring 31.396 33.12 32.45
Guanyuan 32.20 28.68 28.10
Yizhuang 31.07 31.84 30.38
Yufa 42.99 38.09 35.88

It can be seen from the table that when the input factors were normalized, the predicted result of PM2.5 concentration was better than the unnormalized one, and the RMSE was smaller, which indicating that the factors with different dimensions that were input at the same time has a certain influence on the output results. When the input factors and labels were all normalized, but extreme value range of the training data was denormalized, the prediction results were more polarized. Some results were worse than when they were not normalized, such as Yufa's PM2.5 prediction results. Sometimes, normalization on all data were better than just normalizing the input factors, such as the PM2.5 prediction result of East Fourth Ring Road. The reason was because the extreme value range of the training data was not exactly the same as the prediction data period.

Prediction results and accuracy of lightgbm model at one single station

In order to verify the impact of one model at all stations and one model at one stations on PM2.5 prediction results, 34 stations were trained separately, and the prediction results were analyzed. The accuracy indicators were shown in Table 8 below. Since the results of normalizing the labels were not ideal, we did not test this type of input data here. The training results of one model at one station were still consistent with the training results of one model at all stations. When the input factors were normalized, the error of the results was smaller than that of the unnormalized one, which illustrated the importance of the normalization parameter selection.

Table 8.

Comparison of prediction results of three pollutant parameters in LightGBM model at a single site.

RMSE_PM2.5
Unnormalized Input normalization
Dongsihuan 32.57 30.99
Guanyuan 34.58 34.14
Yizhuang 32.17 29.91
Yufa 42.29 41.01

Comparing Tables 7 and 8, we can see that the impact of the model parameter on the accuracy improvement was not as large as the normalization of the input factor, and even results of one model at one station were not as accurate as those of one model at all stations, such as the prediction of PM2.5 at Guanyuan Station and Yufa station. When the data of all stations was placed in one model, the amount of data participated in training was larger than that of placing date of one station in one model, which exactly reflected the superiority of machine learning for massive data analysis. As long as the amount of input data was large enough, the model prediction results will generally be more accurate. As a result, when the decision tree model was used to predict the concentration of air pollutants, the input factors should be labeled and put in a model for training. The model should be unified, but the operating speed should be optimized, for it took a long time to debug the parameters when the amount of data was large.

LSTM prediction results and evaluation

LSTM 3-h input prediction results and accuracy evaluation

The comparison of only normalization on the input factors, normalization on all date and denormalization on all data in the LightGBM model showed that only when the input factors were normalized, the prediction results were better. Therefore, the same data input method was adopted in the training process of the LSTM model, and no attempt was made on the other two types of data. Meanwhile, the accuracy of one model at one station and one model at one station in the LightGBM model was not very noteworthy, which therefore will not be compared again here. Data of one station was used to predict the PM2.5 pollutant concentration. As the input of different durations will have a certain impact on the output, the 3-h input and the 12-h input were selected to determine the influence in different time periods. This section mainly demonstrated the prediction results and accuracy evaluation of the 3-h input. The parameters of the were different for each training, so the prediction results were also uncertain. Generally, data in neural network model required multiple trainings. This article has conducted three trainings for each station. The prediction accuracy results were counted in Table 9, which showed the prediction effect of LSTM was significantly better than that of the LightGBM model, and the error was obviously lower by half. Air quality stations and corresponding weather stations are shown in Table 9.

Table 9.

PM2.5 prediction accuracy of LSTM model for air quality stations and its corresponding weather stations.

ID Air quality stations RMSE Weather stations Type
1 Fangshan 9.334 Fangshan Suburbs
2 Daxing 7.401 Daying
3 Yizhuang 7.913 Beijing
4 Tongzhou 9.392 Tongzhou
5 Shunyi 7.865 Shunyi
6 Changping 10.634 Changping
7 Mentougou 9.572 Mentougou
8 Pinggu 8.091 Pinggu
9 Huairou 8.363 Huairou
10 Miyun 8.695 Miyun
11 Yanqing 11.85 Yanqing
12 Dongsi 9.248 Chaoyang Six major urban areas
13 Tiantan 13.443 Beijing
14 Guanyuan 9.694 Haidian
15 Wanshouxigong 10.061 Fengtai
16 Aotizhongxin 10.410 Chaoyang
17 Nongzhanguan 8.63 Chaoyang
18 Wanliu 10.768 Haidian
19 Beibuxinqu 11.103 Haidian
20 Fengtaihuayuan 11.59 Fengtai
21 Yungang 10.014 Fengtai
22 Gucheng 11.486 Shijingshan
23 Dingling 8.732 Changping Control Points and Regional Points
24 Badaling 11.238 Yanqing
25 Miyunshuiku 6.199 Shangdianzi
26 Donggaocun 9.098 Pinggu
27 Yongledian 13.221 Tongzhou
28 Yufa 13.234 Daying
29 Bolihe 12.842 Fangshan
30 Qianmen 8.597 Beijing Traffic pollution monitoring point
31 Yongdingmennei 10.509 Beijing
32 Xizhimenbei 13.559 Haidian
33 Nansanhuan 10.014 Fengtai
34 Dongsihuan 13.862 Chaoyang

Among all the stations, the Miyun Reservoir owned the smallest PM2.5 prediction error. The maximum RMSE of the East Fourth Ring Road was 13.862, which was much lower than the result predicted by LightGBM, namely 30.99. At the same time, it can be seen that the prediction error of the suburbs was significantly lower than that of the main urban area and traffic pollution points, because there were few meteorological stations in the main urban area. As there was no weather station in the Dongcheng and Xicheng Districts, there will be a certain error in replacing it with adjacent weather conditions. Certainly, it may also be related to the PM2.5 value range of the station.

To better compare the fitting effects of the LSTM model and the LightGBM model, the same four stations were selected for display. The fitting result of the prediction result and the true value was shown in Fig. 8, which proved that although there were some high values, on the whole, the difference between the predicted result and the true value was smaller, and the trend of change was consistent.

Figure 8.

Figure 8

LSTM 3-h input factor normalized PM2.5 pollutant prediction fitting results.

LSTM 12-h input prediction results and accuracy evaluation

Through the error analysis of the PM2.5 data training of the selected four stations, it was found that when the Epoch was around 10, the model converged, and the prediction result at this time was the best. If the error of the test set was lower than that of the training set, overfitting would occur. Generally, the errors of the test set and the training set were intersected or both were stable at one value. When both were in a horizontal state, the model had converged. The four graphs shown in Fig. 9 were all convergent. Under the condition of model convergence, the accuracy of the pollutant prediction results of the four stations and the accuracy results of 3 h were compared (see as Fig. 9). The 12-h LSTM pollutant concentration prediction results were still much better than those of the LightGBM model. But, compared with the 3-h prediction result, the PM2.5 of the East Fourth Ring Road had been changed to 12 h, and the RMSE was slightly reduced. However, increase of the input information does not improve accuracy of all stations. The length of input time should be determined based on the training data. Only through multiple trainings can you find a suitable time length.

Figure.9.

Figure.9

Error comparison between LSTM 12-h input PM2.5 training set and test set.

This article introduced the principles of two machine learning methods used in time series predictions, namely improved decision tree model and neural network model, providing basic knowledge for understanding the construction process of prediction models. The data processing workflow was subsequently explained. After that, the two types of processed data became continuous in time, but did not overlap in space, which was exhibited by comparing the latitude and longitude of the weather station and the air quality station. Therefore, spatial matching was required. Finally, the processed data was added with time characteristics and correlation analysis.

In the process of model construction, because the dimensions of each column of meteorological data and pollutant data were not the same, the input data was divided into three states, namely, denormalization on all data, normalization only on input factors and normalization on all data, so as to verify the impact of data dimensions on the prediction model. Comparing the predicted RMSE in each case, it turned out that the results of normalization on all data were the worst, and the results of normalization only on input factors were the best. Thus, the data that only normalizes the meteorological conditions and time characteristics were finally selected to participate in model training. In addition, a comparative analysis of training data at 34 stations in a unified model and training data at one station in a model was conducted via the LightGBM prediction model, which testified that the difference between the two was not very obvious. Thus, comparison of all-site and single-site models was not performed in the neural network model. The comparison between the prediction results of LightGBM and LSTM models at Dongsi air quality station is shown in Fig. 10.

Figure 10.

Figure 10

Comparison of prediction results between LightGBM and LSTM models at Dongsi air quality station.

Different input steps were set in the prediction results of the neural network model, because it add historical pollutant data compared with the decision tree model. However, it was necessary to explore how long the input of historical data was most beneficial to the prediction results. After comparing the results of three-hour and 12-h data, this study concluded that the prediction result of 3 h was better than 12 h most of the time. Thus, 3 h were used as the final input data of the LSTM prediction model.

Through the prediction accuracy and fitting curve analysis of these two models, it can be seen that the effect of LSTM was significantly better than that of LightGBM model, and the RMSE was reduced by nearly half. Yi Ting Tsai et al. Used LSTM to predict the predicted value of hourly PM2.5 concentration in Taiwan, and its accuracy can reach a high level48. This study has been similarly verified in Beijing.

Conclusion

By comparing the prediction results of the two models, it was found that the RMSE of each pollutant predicted by the LSTM model was nearly 50% lower than that of LightGBM, which was also more consistent with the fitting curve between actual observations. Exploration on the input step size of the LSTM model expressed that the accuracy of the 3-h input data was higher than that of 12-h input data. Prediction models required pre-process of the input data, including input feature extraction, input factor normalization, and data outlier processing. In addition, training data of all stations in one model had little improvement in accuracy compared to training data of one station in one model.

As the impact of air pollution on daily life and people’s health has become more and more prominent, various countries and localities have gradually established a system of multiple monitoring mechanisms and accumulated massive amounts of historical pollution data. Since station monitoring is concentrated on points, it is possible to develop remote sensing data and site data fusion prediction research. Station monitoring has the characteristics of high frequency of all-day monitoring, large area of remote sensing, but lack of evening data. We should comprehensively study the advantages of various monitoring methods and avoid shortcomings, so as to obtain prediction data with high temporal and spatial accuracy and protect people's lives and health.

Supplementary Information

Author contributions

R.Y. and L.Y. proposed conceptualization and methodology. R.Y. and X.L. collected and organized datasets. X.H., L.L., C.W., and Q.L. ran models, analyzed the results, and visualized the study. X.Z. and W.C. wrote the original draft. J.L., S.L., and W.C. reviewed the manuscript. All authors read and approved the final manuscript.

Funding

This research was supported by the National key research and development plan subject "Watershed Non-point Source Pollution Prevention and Control Technology and Application Demonstration Project" (2021YFC3201500) and the Watershed Non-point Source Pollution Prevention and Control Technology and Application Demonstration Project (2021YFC3201505) ; the National Key Research and Development Project (No. 2016YFC0502106); the Natural Science Foundation of China Research (No:41476161) and the Fundamental Research Funds for the Central Universities.

Data availability

All data generated or analysed during this study are included in this published article [and its supplementary information files].

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Rongjin Yang and Lizeyan Yin.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-022-24470-5.

References

  • 1.Du RL. Analysis of the causes of air pollution in china and management measures. Sci. Technol. Innov. Her. 2014;11(20):106. [Google Scholar]
  • 2.She YY, Li ZQ, Wang FL, et al. Variation characteristics and potential source analysis of atmospheric pollutants in west of the Qinling-Daba mountains from 2015 to 2018. Acta Sci. Circum. 2020;40(6):1987–1997. [Google Scholar]
  • 3.Southerland VA, Brauer M, Mohegh A, et al. Global urban temporal trends in fine particulate matter (PM2.5) and attributable health burdens: Estimates from global datasets. Lancet Planet. Health. 2022;6(2):e139–e146. doi: 10.1016/S2542-5196(21)00350-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang LT, Wei Z, Yang J, et al. The 2013 severe haze over southern Hebei, China: Model evaluation, source apportionment, and policy implications. Atmos. Chem. Phys. 2014;14(6):3151–3173. doi: 10.5194/acp-14-3151-2014. [DOI] [Google Scholar]
  • 5.Pope CA, Burnett RT, Thun MJ, et al. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. JAMA-J. Am. Med. Assoc. 2002;287(9):1132–1141. doi: 10.1001/jama.287.9.1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chow JC, Watson JG, Mauderly JL, et al. Health effects of fine particulate air pollution: Lines that connect. J. Air Waste Manag. Assoc. 2006;56(10):1368–1380. doi: 10.1080/10473289.2006.10464545. [DOI] [PubMed] [Google Scholar]
  • 7.Fann NL, Nolte CG, Sarofim MC, et al. associations between simulated future changes in climate, air quality, and human health. Jama Netw. Open. 2021;4(1):e2032064. doi: 10.1001/jamanetworkopen.2020.32064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lelieveld J, Evans JS, Fnais M, et al. The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature. 2015;525(7569):367–371. doi: 10.1038/nature15371. [DOI] [PubMed] [Google Scholar]
  • 9.Lindner CK, Brode P. Impact of biometeorological conditions and air pollution on influenza-like illnesses incidence in Warsaw. Int. J. Biometeorol. 2021;65:929. doi: 10.1007/s00484-021-02076-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Niemeyer LE. Forecasting air olution potential. Mon. Weather Rev. 1960;88(3):88–96. doi: 10.1175/1520-0493(1960)088<0088:FAPP>2.0.CO;2. [DOI] [Google Scholar]
  • 11.Zhang MY, Jie HE. Summary on time series forecasting model. Math. Pract. Theory. 2011;41(18):189–195. [Google Scholar]
  • 12.Yu, J. Y., Yimei, Y., & Jianhua X. A. Hybrid prediction method for stock price using LSTM and ensemble EMD. Complexity, 2020. (2020).
  • 13.Hale J, Long S. A time series sustainability assessment of a partial energy portfolio transition. Energies. 2021;14:141. doi: 10.3390/en14010141. [DOI] [Google Scholar]
  • 14.Santosh T, Ramesh D, Reddy D. LSTM based prediction of malaria abundances using big data. Comput. Biol. Med. 2020;124:103859. doi: 10.1016/j.compbiomed.2020.103859. [DOI] [PubMed] [Google Scholar]
  • 15.Alyousifi Y, Othnan M, Faye I, et al. Markov weighted fuzzy time-series model based on an optimum partition method for forecasting air pollution. Int. J. Fuzzy Syst. 2020;22(5):1468–1486. doi: 10.1007/s40815-020-00841-w. [DOI] [Google Scholar]
  • 16.Yang X, Wu Q, Zhao R, et al. New method for evaluating winter air quality: PM2.5 assessment using Community Multi-Scale Air Quality Modeling (CMAQ) in Xi'an. Atmos. Environ. 2019;211:18–28. doi: 10.1016/j.atmosenv.2019.04.019. [DOI] [Google Scholar]
  • 17.Wang ZS, Li XQ, Wang ZS, et al. Application status of models-3/CMAQ in environmental management. Environ. Sci. Technol. 2013;36(6L):386–391. [Google Scholar]
  • 18.Zhang Y, Shen J, Li Y. An atmospheric vulnerability assessment framework for environment management and protection based on CAMx. J. Environ. Manag. 2018;207:341–54. doi: 10.1016/j.jenvman.2017.11.050. [DOI] [PubMed] [Google Scholar]
  • 19.Karegar E, Hossein Hamzeh N, Bodagh Jamali J, et al. Numerical simulation of extreme dust storms in east of Iran by the WRF-Chem model. Nat. Hazards. 2019;99(2):769–796. doi: 10.1007/s11069-019-03773-3. [DOI] [Google Scholar]
  • 20.Mallet V, Tilloy A, Poulet D, et al. Meta-modeling of ADMS-Urban by dimension reduction and emulation. Atmos. Environ. 2018;184:37–46. doi: 10.1016/j.atmosenv.2018.04.009. [DOI] [Google Scholar]
  • 21.Song PC, Zhang XW, Huang Q, et al. Main forecasting models and applications of urban ambient air quality in China. Sichuan Environ. 2019;38(03):70–76. [Google Scholar]
  • 22.Han ZW, Du SY, Lei XN, et al. Numerical model system of urban air pollution prediction and its application. China Environ. Sci. 2002;03:11–15. [Google Scholar]
  • 23.Ying WU, Wang YX. The effects of NAQPMS model and CMAQ model in ozone forecasting applications. Sichuan Environ. 2019;38(01):81–84. [Google Scholar]
  • 24.Ma S, Zhang X, Gao C, et al. Multimodel simulations of a springtime dust storm over northeastern China: Implications of an evaluation of four commonly used air quality models (CMAQ v5.2.1, CAMx v6.50, CHIMERE v2017r4, and WRF-Chem v3.9.1) Geosci. Model Dev. 2019;12(11):4603–25. doi: 10.5194/gmd-12-4603-2019. [DOI] [Google Scholar]
  • 25.Kukkonen J, Olsson T, Schultz DM, et al. A review of operational, regional-scale, chemical weather forecasting models in Europe. Atmos. Chem. Phys. 2012;12(1):1–87. doi: 10.5194/acp-12-1-2012. [DOI] [Google Scholar]
  • 26.Bai L, Wang J, Ma X, et al. Air pollution forecasts: An overview. Int. J. Environ. Res. Public Health. 2018;15(4):780. doi: 10.3390/ijerph15040780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zhang YL, Yu HE, Zhu JM, et al. Study of the prediction of PM 2.5 based on the multivariate linear regression model. J. Anhui Sci. Technol. Univ. 2016;30(03):92–7. [Google Scholar]
  • 28.Peng SJ, Shen JC, Zhu X, et al. Forecast of PM_(2.5) based on the ARIMA model. Saf. Environ. Eng. 2014;21(06):125–8. [Google Scholar]
  • 29.Liu B, Binaykia A, Chang PC, et al. Urban air quality forecasting based on multidimensional collaborative Support Vector Regression (SVR): A case study of BeijingTianjin-Shijiazhuang. PLoS ONE. 2017;12(7):17. doi: 10.1371/journal.pone.0179763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Taheri Shahraiyni H, Sodoudi S. Statistical modeling approaches for PM10 prediction in urban areas; A review of 21st-century studies. Atmosphere. 2016;7(2):15. doi: 10.3390/atmos7020015. [DOI] [Google Scholar]
  • 31.Abdullah S, Isma M, Ahmed AN, et al. Forecasting particulate matter concentration using linear and non-linear approaches for air quality decision support. Atmosphere. 2019;10(11):24. doi: 10.3390/atmos10110667. [DOI] [Google Scholar]
  • 32.Chen K, Zhu Y. A summary of machine learning and related algorithms. Stat. Inf. Forum. 2007;05:105–112. [Google Scholar]
  • 33.Gocheva-Ilieva SG, Voynikova DS, Stoimenova MP, et al. Regression trees modeling of time series for air pollution analysis and forecasting. Neural Comput. Appl. 2019;31(12):9023–9039. doi: 10.1007/s00521-019-04432-1. [DOI] [Google Scholar]
  • 34.Ren C, Xie G. Prediction of PM_(2.5) concentration level based on random forest and meteorological parameters. Comput. Eng. Appl. 2019;55(2):213–20. [Google Scholar]
  • 35.Sahu SK, Yip S, Holland DM. A fast Bayesian method for updating and forecasting hourly Ozone levels. Environ. Ecol. Stat. 2011;18(1):185–207. doi: 10.1007/s10651-009-0127-y. [DOI] [Google Scholar]
  • 36.Osowski S, Garanty K. Forecasting of the daily meteorological pollution using wavelets and support vector machine. Eng. Appl. Artif. Intell. 2007;20(6):745–755. doi: 10.1016/j.engappai.2006.10.008. [DOI] [Google Scholar]
  • 37.Ruizsuarez JC, Mayoraibarra OA, Torresjimenez J, et al. Short-term Ozone forecasting by artificial neural networks. Adv. Eng. Softw. 1995;23(3):143–149. doi: 10.1016/0965-9978(95)00076-3. [DOI] [Google Scholar]
  • 38.Zhang R, Li W, Mo T. Review of deep learning. Appl. Res. Comput. 2018;47(04):385–97+410. [Google Scholar]
  • 39.He JJ, Yu Y, Xie YC, et al. Numerical model-based artificial neural network model and its application for quantifying impact factors of urban air quality. Water Air Soil Pollut. 2016;227(7):16. doi: 10.1007/s11270-016-2930-z. [DOI] [Google Scholar]
  • 40.Zhang H, Liu Y, Shi R, et al. Evaluation of PM10 forecasting based on the artificial neural network model and intake fraction in an urban area: A case study in Taiyuan City, China. J. Air Waste Manag. Assoc. 2013;63(7):755–763. doi: 10.1080/10962247.2012.755940. [DOI] [PubMed] [Google Scholar]
  • 41.Arhami M, Kamali N, Rajabi MM. Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations. Environ. Sci. Pollut. Res. 2013;20(7):4777–4789. doi: 10.1007/s11356-012-1451-6. [DOI] [PubMed] [Google Scholar]
  • 42.Grivas G, Chaloulakou A. Artificial neural network models for prediction of PM10 hourly concentrations, in the Greater Area of Athens, Greece. Atmos Environ. 2006;40(7):1216–1229. doi: 10.1016/j.atmosenv.2005.10.036. [DOI] [Google Scholar]
  • 43.Kim K, Kim D-K, Noh J, et al. Stable forecasting of environmental time series via long short term memory recurrent neural network. IEEE Access. 2018;6:75216–75228. doi: 10.1109/ACCESS.2018.2884827. [DOI] [Google Scholar]
  • 44.Tsai, Y.-T., Zeng, Y.-R., Chang, Y.-S. Air pollution forecasting using RNN with LSTM. 2018 IEEE 16th Intl Conf on dependable, autonomic and secure computing, 16th Intl Conf on pervasive intelligence and computing, 4th intl conf on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech). 1074–1083 (2018). [DOI] [PMC free article] [PubMed]
  • 45.Yan X, Zang Z, Jiang Y, et al. A spatial-temporal interpretable deep learning model for improving interpretability and predictive accuracy of satellite-based PM2.5. Environ. Pollut. 2021;273:116459. doi: 10.1016/j.envpol.2021.116459. [DOI] [PubMed] [Google Scholar]
  • 46.Ji X, Chang W, Zhang Y, et al. Prediction model of hypertension complications based on GBDT and LightGBM. J. Phys. Conf. Ser. 2021;1813(1):012008. doi: 10.1088/1742-6596/1813/1/012008. [DOI] [Google Scholar]
  • 47.Ma X, Sha J, Wang D, et al. Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electron. Commer. Res. Appl. 2018;31:24–39. doi: 10.1016/j.elerap.2018.08.002. [DOI] [Google Scholar]
  • 48.Tsai, Y.-T., Zeng, Y.-R., Chang, Y.-S. Air pollution forecasting using RNN with LSTM [M]. 2018 IEEE 16th Intl 2.4.1Conf on dependable, autonomic and secure computing, 16th Intl Conf on pervasive intelligence and computing, 4th Intl conf on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech). 1074–1079 (2018). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data generated or analysed during this study are included in this published article [and its supplementary information files].


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES