Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 24.
Published in final edited form as: J Expo Sci Environ Epidemiol. 2020 Aug 3;31(4):699–708. doi: 10.1038/s41370-020-0257-8

Predicting ambient PM2.5 concentrations in Ulaanbaatar, Mongolia with machine learning approaches

Temuulen Enebish 1, Khang Chau 2, Batbayar Jadamba 3, Meredith Franklin 4
PMCID: PMC9871862  NIHMSID: NIHMS1614534  PMID: 32747729

Abstract

Background

Accurately assessing individual ambient air pollution exposure is a crucial part of epidemiological studies looking at the adverse health effect of poor air quality. This is particularly challenging in developing countries with high levels of air pollution but having sparse monitoring networks with a lack of consistent data.

Methods

We evaluated the performance of 6 different machine learning algorithms in predicting fine particulate matter (PM2.5) concentrations in Ulaanbaatar, Mongolia from 2010 to 2018. We found that the algorithms produce robust results based on performance metrics.

Results

Random forest (RF) and gradient boosting models performed the best with leave-one-location-out cross-validated R2 of 0.82 for when using data from the entire study period. After applying tuned models on the hold-out test set, R2 increased to 0.96 for the RF and 0.90 for the gradient boosting model. We also predicted PM2.5 concentrations for each administrative area (khoroo) of the city using RF and maps of predictions show spatiotemporal variations that are in line with the location of the ger district, city center, and population density.

Conclusion

Our results provide evidence of the advantage and feasibility of machine learning approaches in predicting ambient PM2.5 levels in a setting with limited resources and extreme air pollution levels.

Introduction

Adverse effects of ambient fine particulate matter (PM2.5, aerodynamic diameter less than 2.5 μm) on human health have been studied extensively over the past several decades. While we generally have strong evidence about how long and short-term exposures to PM2.5 negatively affect mortality13, morbidity4, respiratory, and cardiovascular diseases5, there are still many questions regarding its association with a multitude of different health endpoints. Furthermore, the majority of existing literature on the health effects of air pollution comes from studies conducted in developed countries where air pollution levels are comparatively low. The availability of air pollution exposure data from different sources such as monitoring networks, satellite data, and chemical transport models has made studying air pollution health effects in low-concentration environments more accessible. Unfortunately, this has led to an underrepresentation of populations in developing countries where the adverse health impact of air pollution is the highest6.

An example of one such country is Mongolia where daily ambient PM2.5 levels during winter in its capital Ulaanbaatar (UB) regularly exceed 150 μg/m3 (Figure 1, WHO air quality guideline - 25 μg/m37). There are two main reasons for the high levels of pollution in UB. First, there was an influx of rural herders who have lost all of their livestock in the Mongolian winter disaster called “dzud” into the city in the early 2000s looking for better employment and education opportunities. They set up their “ger”, yurt-like traditional dwelling, in the outskirts of the city and they use coal-burning stoves to heat and cook during the harsh winter months of Mongolia. This was the start of the so-called “ger district” which mostly inhabited by low-income families and now occupies almost 60% of the city. Second, this spike in emission during winter (Figure 1) is exacerbated by UB’s location inside a valley surrounded by mountains which facilitates the formation of temperature inversion where cold air near the ground is enveloped by warmer air above and traps the pollutants within the breathing zone.

Figure 1:

Figure 1:

Map of study area and data sources of PM2.5 from three organizations: APRA - Air Pollution Reduction Agency of Ulaanbaatar; NAMEM - National Agency for Meteorology and Environmental Monitoring; and the US embassy in Mongolia, which started measuring fine PM in October 2015. a: Showing the 138 khoroos of UB, the extent of city zones, and locations of monitoring stations in the context of both the continent and the nation. b: Measured daily mean PM2.5 showing seasonal variations over the study period (2010–2018).

Air pollution disproportionately harms the people of Mongolia since almost half of the Mongolian population live in UB, the regional and provincial population center with sources including coal-burning power plants and a majority of the population living in the surrounding ger area where people burn coal for heating and cooking. There have been several efforts to measure and identify sources of particulate matter811 in UB and assess its adverse health effects1214 in recent years. However, health studies have been weakened by inaccurate and error-prone exposure assignment methods due to a lack of monitoring stations and their data availability.

Environmental epidemiologists use statistical-based exposure assessment models to interpolate measured concentrations to residential and/or school locations. Models such as inverse distance weighting and kriging are frequently used for spatial interpolation and are based on the premise that measurements taken at nearby monitors are more alike than those with a large distance between them. These methods tend to perform better in places with a dense network of monitoring sites. Extensions of these models to incorporate different spatial and temporal covariates such as meteorological data, satellite measurements, and products from chemical transport models have formed the basis of land-use regression. The limitation, however, is that they often assume linear relationships between the covariates and response, which might not be suitable when modeling complex relationships in a variety of terrains.

Air pollution prediction studies have found promising results with more adaptable models that do not require strict linearity assumptions (e.g. generalized additive models) and different machine learning algorithms (e.g. random forest, gradient boosting, and support vector machine). Specifically, random forest has been used to model PM2.5 in the conterminous US15, urban area encompassing seven counties in Ohio16, Tehran, Iran17, and gradient boosting was used in prediction of daily PM2.5 concentrations across China18. Also, Di et al.19 used a combination of machine learning algorithms through an ensemble method to predict PM2.5 across the contiguous US. Xu et al.20 evaluated the performance of 8 different algorithms in British Columbia, Watson et al.21 showed that random forest and gradient boosting performed well in prediction of ozone during wildfire events in northern California, and our previous work22 demonstrated that support vector machine was superior to other algorithms when incorporating different mixtures of Multi-angle Imaging SpectroRadiometer (MISR) aerosol optical depth (AOD) measurements over UB.

In this study, we aim to improve the PM2.5 exposure assessment for UB by synthesizing a unique set of locally-obtained data into different machine learning algorithms. We evaluate the performance of various algorithms to predict ambient daily PM2.5 levels in Ulaanbaatar, Mongolia by incorporating a variety of spatial and temporal variables related to air pollution in the city.

Materials and Methods

Setting

UB is the political and economic hub of Mongolia. The city is divided into 9 districts and each district consists of a varying number of administrative units called “khoroo”. A khoroo is a subdivision of the city similar to the census tract in the United States. Currently, there are 152 khoroos and they vary widely in size depending on their population density (Figure 1). We used khoroo as our spatial prediction level instead of a regular grid since automatic geocoding is currently not possible for Mongolian addresses due to a lack of standardized addresses. Furthermore, it will facilitate assigning exposure estimates to study participants in a subsequent epidemiological study based on hospital health records. For model validity, we excluded 3 outlying rural districts (14 khoroos) with no regulatory monitoring sites from our prediction. In terms of meteorology and geography, UB experiences short summers and cold, dry winters and is located approximately 1,300 meters above sea level, along the Tuul River in a valley at the foot of the Bogd Khan mountain.

Air pollution monitoring data

Hourly PM2.5 concentration data were obtained from 9 monitoring stations that cover the period between 2010 and 2018. While 5 of these stations are operating under the purview of the UB Air Pollution Reduction Agency (APRA), 3 of them are administered by the National Agency for the Meteorological and Environmental Monitoring (NAMEM) and the remaining station is located at the US Embassy in UB. Monitoring sites are mainly located along the main avenue connecting the east and west side of the city (Figure 1). The agencies use different equipment to measure particle concentration: APRA uses optical particle detection (EDM180, GRIMM Aerosol, Germany); NAMEM (MP101M, ENVEA, France) and the US Embassy (BAM-1020, Met One, US) both use beta ray attenuation technology. The US Embassy data is provided by the US Department of State and is not fully verified or validated. We constructed daily averages and considered it to be missing if more than 25% of the hourly measurements were missing when calculating daily PM2.5 concentration.

Meteorological data

Eight monitoring sites operated by the Government agencies were co-located with weather stations from which we obtained hourly measurements on surface temperature, atmospheric pressure, wind speed, wind direction, and relative humidity. The mean and range of each meteorological variable were calculated over all available sites and included in the models to reduce missingness (about 40% in the original) of the predictors in the models since a majority of machine learning algorithms are sensitive to missing data. Although this approach loses the spatial resolution of weather variables, we have deemed that completeness of the predictors is more important than the spatial information they provide and retain them only as temporal covariates.

Land use and population data

Road length is measured by the length (m) of roads, classified as primary, secondary, or tertiary (Urban Development Agency of UB, UDA), contained in each khoroo. City zone variables were constructed by combining UDA’s official classification of “Central Ger area”, “Middle Ger area”, and “Peri-urban Ger area” and categorizing them as “Ger area” while leaving the other 2 zones the same (“Urban area”, “Summer house”). For the above covariates, values assigned to khoroos that contains the monitoring sites were also assigned to the sites. Elevation (m) data were obtained from the data made available by NASA’s Shuttle Radar Topography Mission in 200023. We selected the closest elevation pixel in the case of assigning elevation to regulatory monitoring sites. On the other hand, the average elevation of all the pixels that fall inside khoroos was used to assign elevation to each khoroo.

Yearly number of total population, number of households with an internet connection, and number of households that live in Ger area for each khoroo were obtained from the UB Department of Statistics. The number of stoves in each khoroo was obtained from the “Report on 2018 Enumeration of Air Pollution Sources in Ulaanbaatar”24 published by the APRA. Total population number and stove numbers were divided by their respective khoroo areas (km2) to determine the density estimate for each, while the number of households with internet connection and number of households that live in Ger area were divided by the total number of households in each khoroo to get proportion estimates. We have also constructed day of year, Julian date variables as well as indicator variables for weekend, Mongolian public holidays, month, and season using our dataset and author’s (T.E.) local knowledge.

Statistical analysis

We evaluated the following 6 predictive algorithms using leave-one-location-out (LOLO) cross-validation (CV) and hold-out test set: random forest (RF)25, gradient boosting (GBM)26, support vector machine with a radial basis kernel (SVM)27, multivariate adaptive regression splines (MARS)28, generalized linear model with elastic net penalties (GLMNET)29, and generalized additive model (GAM)30. We developed three separate models for each machine learning algorithm 1) the entire study period (2010–2018), 2) cold-season (Oct-Mar), and 3) warm-season (Apr-Sep). Datasets were preprocessed for optimization of each learning algorithm using R package recipes31. For instance, while no preprocessing was done on tree-based models (RF, GBM) since decision trees are invariant to monotonic transformations, algorithms such as SVM are sensitive to different ranges of predictors and require normalization. Each model is trained, validated, and tuned on the 85% of the dataset and the remaining 15% of the data were held out as the test set that has not been involved in any of the above processes at all.

Model validation and tuning

We used root mean square error (RMSE) and R2 as our performance metrics to optimize for LOLO CV in validation and tuning of our model hyperparameters. k-fold CV is a type of non-exhaustive CV technique that validate models using k equal sized subsamples. In each iteration of k, a single subsample is retained as a validation set and each k subsample will serve as a validation set exactly once. Although this method is computationally efficient, they can be over-optimistic in their estimation of performance metrics when used on spatially or temporally dependent data32. Air pollution data from fixed-site regulatory monitoring stations are good examples of spatiotemporally dependent data due to their nature. LOLO CV is a type of exhaustive CV technique that is suitable for modeling air pollution data. We have used LOLO CV to train the models on all but one location at a time for the number of unique locations we have in the data (9 sites in our case) and prediction errors are averaged across the repeats to give us an error estimate. This ensures that no observation from the validation location will be involved in training the model, unlike k-fold CV where observations are uniformly distributed among folds at random. Besides, having relatively few locations allows us to estimate model RMSE and R2 more accurately despite LOLO CV’s high computational cost.

Each learning algorithm has its own set of tuning parameters and parameter values that are optimized for a given data gives better prediction performance. In general, parameter tuning is a process of searching through a parameter space composed of different types of grids (regular, random) to find optimal values for better performance. We have used a space-filling design called maximum entropy to fill our parameter grid with 30 rows. Model parameters were chosen after training 270 models that were consisted of 9 resamples from our 9 sites for each of the 30 different parameter combinations selected by maximum entropy. The best performing model parameters selected by the lowest RMSE are used for fitting each model on the full training data (Table S1). The final fitted models were then used to predict PM2.5 on the hold-out test set to evaluate their performance on data that were not involved in the model building process.

Statistical computing and code availability

All of our data analysis and modeling was conducted in R 3.6.133. “Tidyverse” set of packages34 were extensively used in data cleaning and manipulation. Geographic calculations in the WGS 84 / UTM Zone 48N (EPSG:32648) were carried out using the sf35 package. Cross-validation36, model tuning37, and model fitting38 were implemented with the help of “tidymodels” ecosystem of packages. Source codes are available upon request.

Results

Summary statistics of observed PM2.5 concentrations (μg/m3) and all the covariates used in predicting PM2.5 are shown in Table 1. While spatiotemporal covariates are defined at khoroo levels spatially and yearly temporally, spatial covariates only differ by khoroo boundaries and temporal covariates differ by an only daily change in measurement.

Table 1.

Variables used to predict PM2.5

Variables Mean SD Data Source
PM2.5, μg/m3
Entire Period 70.44 67.31 3 monitors from National Agency for the Meteorological and Environmental Monitoring (NAMEM), 5 monitors from Ulaanbaatar Air
Cold Season (Oct-Mar) 112.50 70.80
Warm Season (Apr-Sep) 26.74 16.82 Pollution Reduction Agency (APRA), and 1 monitor from U.S. Embassy in Mongolia
Spatial
Monitor Longitude 106.88 0.07 NAMEM
Monitor Latitude 47.92 0.02
Stove Density, per km2 525.36 559.01 APRA
Road Length, m 1371.16 1123.20 Urban Development Agency of Ulaanbaatar City
City Zones NA NA
Elevation, m 1320.11 17.10 Shuttle Radar Topography Mission, NASA
Temporal
Atmospheric Temperature, °C 0.13 14.92 NAMEM
Wind Speed, m/s 0.90 0.55
Wind Direction, ° 194.92 112.33
Relative Humidity, % 56.98 13.47
Surface Pressure, mmHg 866.87 5.85
Spatiotemporal
Proportion of Households with an Internet Connectivity 0.39 0.31 Statistics Department of Ulaanbaatar City
Proportion of Households who live in the Ger area 0.54 0.48
Population Density, per km2 16286.41 12422.48

We show performance metrics separately for the three different periods that models were fit, in addition to LOLO CV and hold-out test set performance. Table 2 shows the model accuracy (RMSE) and model consistency/correlation (R2) measures for each prediction algorithm. The RF (LOLO CV for the entire period: RMSE = 29.52 μg/m3, R2 = 0.82) and the GBM models (LOLO CV for the entire period: RMSE = 30.02 μg/m3, R2 = 0.82) have consistently the best performance for all three models (entire period, cold and warm season) as well as for both LOLO CV and hold-out test set. Overall, we saw that the models that used observations from the entire study period perform better than the models using observations from only cold or warm seasons (Figure S1 and Figure S2). We also observed a persistent trend of higher correlation metrics in cold season models in comparison to higher accuracy metrics in warm-season models. In general, the performance metrics generated from LOLO CV were lower than the metrics derived from the 15% hold-out test set. One exception to this was the GAM, which produced relatively good performance metrics in the hold-out test set despite yielding much worse performance metrics than the other models in LOLO CV.

Table 2.

Comparison of model performance metrics for prediction of PM2.5

Entire Period (n = 12 590) Cold Season (n = 6 416) Warm Season (n = 6 175)
Model RMSE R2 RMSE R2 RMSE R2
Leave-One-Location-Out Cross Validation Random Forest 29.52 0.82 39.58 0.72 12.26 0.49
Gradient Boosting 30.02 0.82 40.35 0.71 12.29 0.47
SVM 38.92 0.72 52.29 0.56 14.97 0.29
MARS 37.42 0.72 48.78 0.58 15.37 0.25
GLMNET 38.91 0.70 51.96 0.54 15.73 0.19
GAM 69.81 0.68 108.41 0.53 85.34 0.13
Hold-Out Test Random Forest 12.92 0.96 21.23 0.92 7.44 0.84
Gradient Boosting 21.29 0.90 28.30 0.84 9.47 0.68
SVM 33.31 0.76 43.94 0.65 13.13 0.42
MARS 34.15 0.75 42.60 0.64 13.35 0.37
GLMNET 38.19 0.70 48.93 0.55 15.40 0.23
GAM 33.06 0.76 39.52 0.69 12.95 0.41

A scatter plot between observed and predicted values from the RF model (Figure 2) demonstrates that the model is somewhat underpredicting at higher PM2.5 concentrations in LOLO CV. This tendency, however, is not as noticeable in the hold-out test set for the entire period. Figure 3 displays the average seasonal predictions from the RF model overlaid on the khoroo map in the context of the population density of UB. The predictions display higher PM2.5 concentrations in the north side of the city where most of the ger area is located (Figure 1) for both cold and warm seasons.

Figure 2:

Figure 2:

Random Forest model predictions from leave-one-location-out cross-validation and hold-out test set plotted against observed PM2.5 for the entire study period.

Figure 3:

Figure 3:

Predictions from the Random Forest model applied to the entire study period are shown here as cold (Oct-Mar) and warm (Apr-Sep) season averages in the context of population density of UB. Only khoroos with more than 1 000 people per km2 are shown here.

In addition, we calculated variable importance scores for our best performing model RF and showed 10 variables with the highest scores in Figure 4. Temperature, wind, date variables as well as densities of stove and primary road were the most predictive of PM2.5 concentration in UB according to the RF model.

Figure 4:

Figure 4:

Variable importance scores for the 10 most important variables in the Random Forest model (descending order).

Discussion

We evaluated the performance of six different machine learning algorithms and used the best performing model to predict daily PM2.5 concentrations from 2010 to 2018 at each khoroos of Ulaanbaatar, Mongolia, a city with a dangerous air pollution levels yet lacking in monitoring capacity. Our study has demonstrated the feasibility of predicting ground-level PM2.5 using machine learning models at irregularly sized locales such as administrative areas with inadequate air pollution monitoring network. We found that decision tree-based ensemble models such as RF and GBM had the most predictive power in both LOLO CV and hold-out test sets. Also, we observed that the predictions from the RF model at UB khoroos produced maps with good spatial and temporal variations. Further, the most important variables in predicting PM2.5 concentrations consisted of a mix of meteorological and land-use variables.

Decision trees can be either classification or regression trees based on the outcome type. Briefly, they are constructed by applying splitting rules on each consecutively smaller partitions (nodes) of the tree. These split rules are usually based on variance (heterogeneity) or class diversity (node impurity) of the nodes. Although decision trees have the benefits of being very interpretable and including higher-order interactions, they are prone to overfitting and highly sensitive to small data disturbances39. These limitations are significantly mitigated by using ensemble methods that use bagging (RF) or boosting (GBM) techniques. In RF, models are forced to be trained on random subsets of variables at each split instead of all variables, which in turn leads to higher possibilities of different split candidates that would not have been considered otherwise. On the other hand, GBM uses gradient descent to optimize any differentiable loss function continuously while also training models on the subsets of the original data40.

Our RF model results (LOLO CV R2 = 0.82, Test R2 = 0.96) are comparable or better than similar studies that predict PM2.5 using machine learning algorithms over different geographical areas such as the contiguous US15,19 or China18, northern California41, British Columbia20 as well as metropolitan areas like Cincinatti, OH16 and Tehran, Iran17,42. For the entire US, Hu et al.15 estimated an overall CV R2 of 0.80 using RF with a convolutional layer as a covariate and Di et al.19 produced an average CV R2 of 0.86 using an ensemble model that incorporated information from neural network, RF, and GBM models while Zhan et al.18 modeled PM2.5 across China using a geographically weighted GBM which resulted in CV R2 of 0.76. Multiple studies evaluated machine learning techniques for the prediction of PM2.5 concentrations. Reid et al.41 looked at the performance of 11 different algorithms to predict fine PM exposure during the 2008 Northern California Wildfires and found that GBM performed the best with CV R2 of 0.80. In British Columbia, Xu et al.20 compared 8 models and found the Cubist, RF, and GBM to have better performance with the Cubist having the highest CV R2 of 0.48. For smaller area prediction, Brokamp et al.16 used RF in the Cincinnati, OH metro area and produced a very high overall CV R2 of 0.91 while Nabavi et al.17 and Zamani Joharestani et al.42 found that their best CV R2 values are 0.68 using RF and 0.81 using GBM, respectively, after evaluating a few different algorithms in Tehran, Iran. All of the above studies have used a similar set of covariates, including observed ground-level PM2.5 levels, meteorological parameters, land-use variables as well as different types of aerosol optical depth (AOD) measurements. Some of the studies15,16,19 used the convolutional layer as a predictor in their models to account for spatial and temporal autocorrelations of PM2.5 level. The main difference of our models in comparison is that we have not included AOD as a predictor in our models, mainly due to preliminary analyses (not shown) indicating almost no difference in model performance when incorporating 1 km × 1 km MAIAC implementation of MODIS AOD43. Our previous work22 explored the performance of machine learning algorithms for predicting particulate matter in Mongolia using a high-dimensional Multi-Angle Imaging SpectroRadiometer (MISR) aerosol measurements. We found moderate predictive performance (CV R2 of 0.46 for PM2.5) using SVM and demonstrated the ability of the MISR AOD mixture set in differentiating particulate types, including sulfates from sulfur-rich coal, in UB. There was however an issue with satellite retrievals during the wintertime due to bright surface from snow cover, so the study mostly focused on summertime AOD-PM2.5 associations. The current study has more complete data in the winter and we predict PM2.5 into irregular administrative areas (khoroos) instead of regular grids. Future work will incorporate additional satellite products (MODIS, MISR, MERRA2) with more sophisticated missing observation gap-filling techniques.

Furthermore, it is of increasing interest to incorporate particulate matter measurements from low-cost sensors into the air pollution prediction models as they will fill spatiotemporal gaps in the data that are currently available. Traditional gravimetric reference method and equivalent methods such as tapered element oscillating microbalance or beta-ray attenuation monitoring not only have high initial equipment costs, but also require regular maintenance throughout their lifetimes. As a result, they are used for regulatory purposes and located mostly along main roads or industrial areas which lead to their limited ability to reflect spatial and temporal resolution of particulate matters44. These limitations can be addressed by low-cost sensors (typically < $1 000), which also have the benefit of portability, ease of use and maintenance, and ability to be spatially deployed much more densely than reference monitors45. However, these features come with measurement uncertainty caused by their measurement technique (light-scattering) itself. In addition to the uncertainty associated with particle counts measured by light-scattering at different temperature and humidity levels, there is also uncertainty induced by every manufacturer using their own proprietary algorithms (mostly not available to users) to convert particle count to mass by assuming constant shape and density of particles46. Previous studies demonstrated that preliminary testing and calibration of low-cost sensors are needed at the locations where sensors are planned to be deployed due to differences in emission sources, particulate composition, and meteorological factors despite the good correlation between low-cost sensors and reference monitors47,48.

Currently, our RF model prediction is mostly driven by temporal predictors such as weather and time variables as shown in Figure 4. A study in California49 showed that adding low-cost sensor network data into a reference monitor only model increased the importance of spatial variables. We believe having low-cost sensor data from the ger district could show similar trend and enhance the fine-scale variability of PM2.5 prediction produced by machine learning models in UB after sufficient pre-test and calibration procedures. In recent years, a few studies incorporating low-cost sensor data to air pollution prediction models started to emerge. For example, Masiol et al.50 used hourly low-cost sensor data to supplement land-use regression model to increase spatial and temporal coverage of their prediction model in Monroe County, New York, U.S. while Bi et al.49 incorporated measurements from an existing low-cost sensor network in Imperial County, California, US to their RF model and showed that low-cost sensor data could improve the accuracy (R2 increase of ~0.2 over regulatory monitors-only model) of PM2.5 predictions with more spatial details. In their other work51, Bi et al. presented the strategy of downweighting a calibrated low-cost sensor data from PurpleAir network in their RF model and produced a model with better accuracy and correlation measures than a model based on EPA monitors only.

There have been several studies to determine the sources and compositions of particulate matter and model air pollution exposure assessment in UB. Davy et al.8 determined that combustions from coal, biomass burning, and motor vehicles are the largest contributors to fine PM concentration and that coal combustion contribution, in particular, increased significantly during winter in UB. Nishikawa et al.11 and Batmunkh et al.9 reached similar conclusions that soot and organic carbons were highly correlated during the heating season and likely a result of coal combustion. Our PM2.5 predictions to khoroo from the RF model capture this pattern very well (Figure 1, Figure 3, and Figure S3) by displaying the spatial difference of ger area and city area and temporal difference between cold and warm seasons in PM2.5 exposure within the city. Moreover, daytime and nighttime differences in local urban winds affect the temperature inversion layer and trap pollutants in the boundary layer during winter time according to Ganbat et al.52. This is in line with what we found from our variable importance scores of the RF model (Figure 4) and show that the temperature and the wind speed along with the UB-specific covariates such as density of coal stoves are important in prediction of PM2.5 in the city.

Further, efforts to model ambient PM2.5 exposure for epidemiological studies have been almost nonexistent in UB. We hope to alleviate this issue and use our RF model to predict PM2.5 at the khoroo level of UB in conducting epidemiological studies looking at adverse health effects of particulate matter.

A few strengths and limitations of our study should be mentioned. First, to our knowledge this is the first attempt to statistically model ambient PM2.5 exposure at a small spatiotemporal scale in UB. Second, the source and composition of particulate matter differs between developed and developing countries and the inclusion of locale-specific covariates such as stove density, proportion of households with an internet connection (a proxy for socioeconomic status) helped to model this variability better. Third, leave-one-location-out cross-validation is a much more rigorous and appropriate technique for spatiotemporally dependent air pollution data than the k-fold cross-validation. We were able to utilize LOLO due to the relatively low number of monitoring stations in our data. In terms of limitations, we were unable to incorporate AOD measurement into our model due to large proportion of missing values caused by snow covers and clouds in UB, predicting on irregular areas resulted in larger khoroos being assigned the same concentration level across the whole area, and locations of the monitors were not representative of the whole city with only one monitor measuring solely ger area PM2.5 levels (Figure 1).

In conclusion, we demonstrated the strengths of utilizing machine learning algorithms to predict PM2.5 in a location with a sparse monitoring setting and unique pollution sources by producing model performance metrics better or comparable to similar works. Starting in May 2019, the UB city administration has banned the usage of raw coal within the contiguous city boundary (excluding satellite districts). Our model will allow us to examine the impact this ban has on air quality in the UB region, and we will be able to assess any health benefits with continued collection of monitoring data and health records.

Supplementary Material

1

Acknowledgements

T.E. would like to express his gratitude towards Dr. David Warburton of Saban Research Institute, Children’s Hospital Los Angeles and Dr. Rima Habre of Department of Preventive Medicine, University of Southern California for their support and advice. The authors also would like to thank Unurbat Dorj from NAMEM and Sanchir Dash from APRA for their help and support in acquiring and understanding UB air pollution data.

Funding

Doctoral training of T.E. was supported by the National Institutes of Health Fogarty International Center/National Institute of Environmental Health Sciences demonstration and education grant (1D43ES022862-01A1) between 2014 and 2017.

Footnotes

Conflict of Interest

The authors declare no competing financial interests in relation to the work described.

Contributor Information

Temuulen Enebish, Department of Preventive Medicine, University of Southern California

Khang Chau, Department of Preventive Medicine, University of Southern California

Batbayar Jadamba, Department of Environmental Monitoring, National Agency for Meteorology and Environmental Monitoring of Mongolia

Meredith Franklin, Department of Preventive Medicine, University of Southern California

References

  • 1.Franklin M, Zeka A, Schwartz J. Association between PM2.5 and all-cause and specific-cause mortality in 27 US communities. Journal of Exposure Science and Environmental Epidemiology 2007; 17: 279–287. [DOI] [PubMed] [Google Scholar]
  • 2.Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C et al. Air Pollution and Mortality in the Medicare Population. New England Journal of Medicine 2017; 376: 2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pope CA, Coleman N, Pond ZA, Burnett RT. Fine particulate air pollution and human mortality: 25+ years of cohort studies. Environmental Research 2019;: 108924. [DOI] [PubMed] [Google Scholar]
  • 4.Lippmann M, Ito K, N’adas A, Burnett RT. Association of particulate matter components with daily mortality and morbidity in urban populations. Research Report (Health Effects Institute) 2000;: 5–72, discussion73–82. [PubMed] [Google Scholar]
  • 5.Brook RD, Rajagopalan S, Pope CA, Brook JR, Bhatnagar A, Diez-Roux AV et al. Particulate Matter Air Pollution and Cardiovascular Disease: An Update to the Scientific Statement From the American Heart Association. Circulation 2010; 121: 2331–2378. [DOI] [PubMed] [Google Scholar]
  • 6.Cohen AJ, Brauer M, Burnett R, Anderson HR, Frostad J, Estep K et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: An analysis of data from the Global Burden of Diseases Study 2015. The Lancet 2017; 389: 1907–1918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.World Health Organization. WHO Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide: Global update 2005: Summary of risk assessment. Geneva: World Health Organization, 2006. [Google Scholar]
  • 8.Davy PK, Gunchin G, Markwitz A, Trompetter WJ, Barry BJ, Shagjjamba D et al. Air particulate matter pollution in Ulaanbaatar, Mongolia: Determination of composition, source contributions and source locations. Atmospheric Pollution Research 2011; 2: 126–137. [Google Scholar]
  • 9.Batmunkh T, Kim YJ, Jung JS, Park K, Tumendemberel B. Chemical characteristics of fine particulate matters measured during severe winter haze events in Ulaanbaatar, Mongolia. Journal of the Air & Waste Management Association 2013; 63: 659–670. [DOI] [PubMed] [Google Scholar]
  • 10.Guttikunda SK, Lodoysamba S, Bulgansaikhan B, Dashdondog B. Particulate pollution in Ulaanbaatar, Mongolia. Air Quality, Atmosphere and Health 2013; 6: 589–601. [Google Scholar]
  • 11.Nishikawa M, Matsui I, Batdorj D, Jugder D, Mori I, Shimizu A et al. Chemical composition of urban airborne particulate matter in Ulaanbaatar. Atmospheric Environment 2011; 45: 5710–5715. [Google Scholar]
  • 12.Allen RW, Gombojav E, Barkhasragchaa B, Byambaa T, Lkhasuren O, Amram O et al. An assessment of air pollution and its attributable mortality in Ulaanbaatar, Mongolia. Air Quality, Atmosphere and Health 2013; 6: 137–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Enkh-Undraa D, Kanda S, Shima M, Shimono T, Miyake M, Yoda Y et al. Coal burning-derived SO2 and traffic-derived NO2 are associated with persistent cough and current wheezing symptoms among schoolchildren in Ulaanbaatar, Mongolia. Environmental Health and Preventive Medicine 2019; 24: 66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Enkhmaa D, Warburton N, Javzandulam B, Uyanga J, Khishigsuren Y, Lodoysamba S et al. Seasonal ambient air pollution correlates strongly with spontaneous abortion in Mongolia. BMC pregnancy and childbirth 2014; 14: 146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hu X, Belle JH, Meng X, Wildani A, Waller LA, Strickland MJ et al. Estimating PM 2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. Environmental Science & Technology 2017; 51: 6936–6944. [DOI] [PubMed] [Google Scholar]
  • 16.Brokamp C, Jandarov R, Hossain M, Ryan P. Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. Environmental Science & Technology 2018; 52: 4173–4179. [DOI] [PubMed] [Google Scholar]
  • 17.Nabavi SO, Haimberger L, Abbasi E. Assessing PM2.5 concentrations in Tehran, Iran, from space using MAIAC, deep blue, and dark target AOD and machine learning algorithms. Atmospheric Pollution Research 2019; 10: 889–903. [Google Scholar]
  • 18.Zhan Y, Luo Y, Deng X, Chen H, Grieneisen ML, Shen X et al. Spatiotemporal prediction of continuous daily PM 2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment 2017; 155: 129–139. [Google Scholar]
  • 19.Di Q, Amini H, Shi L, Kloog I, Silvern R, Kelly J et al. An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environment International 2019; 130: 104909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Xu Y, Ho HC, Wong MS, Deng C, Shi Y, Chan T-C et al. Evaluation of machine learning techniques with multiple remote sensing datasets in estimating monthly concentrations of ground-level PM2.5. Environmental Pollution 2018; 242: 1417–1426. [DOI] [PubMed] [Google Scholar]
  • 21.Watson GL, Telesca D, Reid CE, Pfister GG, Jerrett M. Machine learning models accurately predict ozone exposure during wildfire events. Environmental Pollution 2019; 254: 112792. [DOI] [PubMed] [Google Scholar]
  • 22.Franklin M, Chau K, Kalashnikova O, Garay M, Enebish T, Sorek-Hamer M. Using Multi-Angle Imaging SpectroRadiometer Aerosol Mixture Properties for Air Quality Assessment in Mongolia. Remote Sensing 2018; 10: 1317. [Google Scholar]
  • 23.Jarvis A, Reuter HI, Nelson A, Guevara E. Hole-filled seamless srtm data version 4. International Center for Tropical Agriculture (CIAT), available at: http://srtmcsicgiar.org (last access: 27 June 2019) 2008. [Google Scholar]
  • 24.Narmandakh L, Galymbek K, Tsatsral B. Report on 2018 Enumeration of Air Pollution Sources in Ulaanbaatar. UB Air Pollution Reduction Agency: Ulaanbaatar, Mongolia, 2018. [Google Scholar]
  • 25.Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 2017; 77: 1–17. [Google Scholar]
  • 26.Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H et al. Xgboost: Extreme gradient boosting. 2019. https://CRAN.R-project.org/package=xgboost.
  • 27.Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab – an S4 package for kernel methods in R. Journal of Statistical Software 2004; 11: 1–20. [Google Scholar]
  • 28.Trevor Hastie SMD from mda:mars by, Thomas Lumley’s leaps wrapper. RTUAMF utilities with. Earth: Multivariate adaptive regression splines. 2019. https://CRAN.R-project.org/package=earth. [Google Scholar]
  • 29.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 2010; 33: 1–22. [PMC free article] [PubMed] [Google Scholar]
  • 30.Wood SN N, Pya S ”afken B. Smoothing parameter and model selection for general smooth models (with discussion). Journal of the American Statistical Association 2016; 111: 1548–1575. [Google Scholar]
  • 31.Kuhn M, Wickham H. Recipes: Preprocessing tools to create design matrices. 2019. https://github.com/tidymodels/recipes.
  • 32.Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017; 40: 913–929. [Google Scholar]
  • 33.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria, 2019. https://www.R-project.org/. [Google Scholar]
  • 34.Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Francois R et al. Welcome to the tidyverse. Journal of Open Source Software 2019; 4: 1686. [Google Scholar]
  • 35.Pebesma E. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 2018; 10: 439–446. [Google Scholar]
  • 36.Kuhn M, Chow F, Wickham H. Rsample: General resampling infrastructure. 2019.
  • 37.Kuhn M. Tune: Tidy tuning tools. 2019. https://github.com/tidymodels/tune.
  • 38.Kuhn M, Vaughan D. Parsnip: A common api to modeling and analysis functions. 2019.
  • 39.Breiman L (ed.). Classification and regression trees. Repr. Chapman & Hall [u.a.]: Boca Raton, 1998. [Google Scholar]
  • 40.Bi Q, Goodman KE, Kaminsky J, Lessler J. What is Machine Learning? A Primer for the Epidemiologist. American Journal of Epidemiology 2019. doi: 10.1093/aje/kwz189. [DOI] [PubMed] [Google Scholar]
  • 41.Reid CE, Jerrett M, Petersen ML, Pfister GG, Morefield PE, Tager IB et al. Spatiotemporal Prediction of Fine Particulate Matter During the 2008 Northern California Wildfires Using Machine Learning. Environmental Science & Technology 2015; 49: 3887–3896. [DOI] [PubMed] [Google Scholar]
  • 42.Zamani Joharestani M, Cao C, Ni X, Bashir B, Talebiesfandarani S. PM2.5 Prediction Based on Random Forest, XGBoost, and Deep Learning Using Multisource Remote Sensing Data. Atmosphere 2019; 10: 373. [Google Scholar]
  • 43.Lyapustin A, Wang Y, Korkin S, Huang D. MODIS Collection 6 MAIAC algorithm. Atmospheric Measurement Techniques 2018; 11: 5741–5765. [Google Scholar]
  • 44.Snyder EG, Watkins TH, Solomon PA, Thoma ED, Williams RW, Hagler GS et al. The changing paradigm of air pollution monitoring. 2013. [DOI] [PubMed]
  • 45.Morawska L, Thai PK, Liu X, Asumadu-Sakyi A, Ayoko G, Bartonova A et al. Applications of low-cost sensing technologies for air quality monitoring and exposure assessment: How far have they gone? Environment International 2018; 116: 286–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Castell N, Dauge FR, Schneider P, Vogt M, Lerner U, Fishbain B et al. Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates? Environment International 2017; 99: 293–302. [DOI] [PubMed] [Google Scholar]
  • 47.Bulot FMJ, Johnston SJ, Basford PJ, Easton NHC, Apetroaie-Cristea M, Foster GL et al. Long-term field comparison of multiple low-cost particulate matter sensors in an outdoor urban environment. Scientific Reports 2019; 9. doi: 10.1038/s41598-019-43716-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kelly KE, Whitaker J, Petty A, Widmer C, Dybwad A, Sleeth D et al. Ambient and laboratory evaluation of a low-cost particulate matter sensor. Environmental Pollution 2017; 221: 491–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Bi J, Stowell J, Seto EYW, English PB, Al-Hamdan MZ, Kinney PL et al. Contribution of low-cost sensor measurements to the prediction of PM2.5 levels: A case study in Imperial County, California, USA. Environmental Research 2020; 180: 108810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Masiol M, Z’ıkov’a N, Chalupa DC, Rich DQ, Ferro AR, Hopk PK. Hourly land-use regression models based on low-cost PM monitor data. Environmental Research 2018; 167: 7–14. [DOI] [PubMed] [Google Scholar]
  • 51.Bi J, Wildani A, Chang HH, Liu Y. Incorporating Low-Cost Sensor Measurements into High-Resolution PM 2.5 Modeling at a Large Spatial Scale. Environmental Science & Technology 2020; 54: 2152–2162. [DOI] [PubMed] [Google Scholar]
  • 52.Ganbat G, Baik JJ. Wintertime winds in and around the Ulaanbaatar metropolitan area in the presence of a temperature inversion. Asia-Pacific Journal of Atmospheric Sciences 2016; 52: 309–325. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES