Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 1.
Published in final edited form as: Environ Int. 2019 Jul 27;131:105022. doi: 10.1016/j.envint.2019.105022

Mapping urban air quality using mobile sampling with low-cost sensors and machine learning in Seoul, South Korea

Chris C Lim 1, Ho Kim 2, M J Ruzmyn Vilcassim 1, George D Thurston 1, Terry Gordon 1, Lung-chi Chen 1, Kiyoung Lee 2, Michael Heimbinder 3, Sun-Young Kim 4
PMCID: PMC6728172  NIHMSID: NIHMS1536250  PMID: 31362154

Abstract

Recent studies have demonstrated that mobile sampling can improve the spatial granularity of land use regression (LUR) models. Mobile sampling campaigns deploying low-cost (<$300) air quality sensors could potentially offer an inexpensive and practical approach to measure and model air pollution concentration levels. In this study, we developed LUR models for street-level fine particulate matter (PM2.5) concentration levels in Seoul, South Korea. 169 hours of data were collected from an approximately three week long campaign across five routes by ten volunteers sharing seven AirBeams, a low-cost ($250 per unit), smartphone-based particle counter, while geospatial data were extracted from OpenStreetMap, an open-source and crowd-generated geographical dataset. We applied and compared three statistical approaches in constructing the LUR models – linear regression (LR), random forest (RF), and stacked ensemble (SE) combining multiple machine learning algorithms – which resulted in cross-validation R2 values of 0.63, 0.73, and 0.80, respectively and identification of several pollution ‘hotspots.’ The high R2 values suggest that study designs employing mobile sampling in conjunction with multiple low-cost air quality monitors could be applied to characterize urban street-level air quality with high spatial resolution, and that machine learning models could further improve model performance. Given this study design’s cost-effectiveness and ease of implementation, similar approaches may be especially suitable for citizen science and community-based endeavors, or in regions bereft of air quality data and preexisting air monitoring networks, such as developing countries.

1. INTRODUCTION

Ambient air pollution is a major global public health concern, with the World Health Organization estimating that 4.2 million premature deaths annually are attributable to fine particulate matter (PM2.5) exposure (WHO, 2018). Government and regulatory agencies throughout the world have traditionally relied on networks of fixed-site monitors in order to measure air quality and establish standards. Owing to their prohibitive equipment and operational costs, these monitors tend to be sparsely located even in large metropolitan cities, or may be entirely missing in many locales. However, as concentrations of air pollutants can vary markedly over small distances and short time periods, the urban environment cannot be fully characterized using information from sparse, static networks of air pollution monitors (Kumar et al., 2015). To empirically model and characterize the spatial or spatiotemporal variability of PM2.5 concentrations, land use regression (LUR) models based on data from monitoring networks have been employed. Recently, LUR models based on data collected from mobile sampling designs – where predetermined locations or routes are repeatedly sampled on modes of transport – have gained traction, offering improved spatial resolution at a lower cost (e.g., Hankey and Marshall, 2015; Shi et al., 2016; Deville Cavellin et al., 2016).

Recent technological advancements and proliferation of air quality sensors offer additional avenues to refine the spatiotemporal characterization of air pollution levels. Numerous instruments from commercial entities, non-profits, and startups have entered the market to date (Borghi et al., 2017; McKerchner et al., 2017), although the performance of these sensors can differ substantially between the different models as well as between individual units, as noted by evaluations in field and laboratory settings (Jiao et al., 2016; Jerrett et al., 2017; Castell et al., 2017; Kelly et al., 2017; Feinberg et al., 2018; Levy Zamora et al., 2019). Offering the capability to inexpensively generate a large volume of data, distributed networks of low-cost air quality sensors are beginning to be established to augment existing monitoring networks or provide novel real-time data streams (Gao et al., 2015; Schneider et al., 2017; Zikova et al., 2017). Noteworthy examples of collaborative endeavors between government agencies, research organizations, and communities include: ‘OpenSense’ in Geneva, Switzerland (Hasenfratz et al., 2015), ‘Array of Things’ in Chicago, U.S (Catlett et al., 2017), and the Imperial County Community Air Monitoring Network (English et al., 2017) in California, U.S.

LUR models based on data collected from mobile sampling with low-cost (<$300) consumer-based sensors are very limited thus far, which could potentially offer a highly cost-effective approach to model and map air pollution concentration levels. The main aim of this study was to deploy multiple smartphone-based particle counter ‘AirBeam’ to measure and model street-level urban air quality in Seoul, South Korea, a location with limited fixed regulatory monitoring sites relative to the high population and diverse urban environments. The individual AirBeam units were first collocated with a pDR-1500 within a laboratory setting to adjust for intra-instrument variability and equate particle counts to mass equivalents, and a mobile sampling campaign was conducted by repeatedly walking across five routes during an approximately three-week period. The collected air pollution data, together with an openly available and crowd-sourced geographical data source OpenStreetMap (OSM), were then used to construct LUR models with both linear regression and machine learning methods. This work explores the potential of mobile sampling with low-cost air quality sensors, machine learning models, and ‘open data’ sources to characterize street-level air quality in urban locations with fine spatial resolution.

2. MATERIALS AND METHODS

2.1. Equipment Description and Intra-Instrument Variability Adjustment

The internal optical particle sensor of the AirBeam (dimensions: 105 × 95 × 43.5mm; weight: 198g) is the PPD60PV-T2 (detectable particle range: 0 to 400 μg/m3; detective particle size 0.5–2.5 μm) from Shinyei Technology Co. LTD. (Kyoto, Japan), connected to an Android OX smartphone running the AirCasting application (aircasting.org). Supplemental Figure 1 depicts the AirBeam, its specifications, and the Android AirCasting app. This mobile system is capable of continuous measurement (programmable intervals as little as per 1 second) and mapping (by GPS and Google Maps). The platform code is open-source, and collected data can be shared and mapped via an online platform, ‘Aircasting’ (www.aircasting.org/map).

To adjust for potential intra-instrument variability and to convert particle counts to PM2.5 mass equivalents, the AirBeam units were collocated with a DataRAM pDR-1500 (Thermo Scientific, Franklin, MA) within a concentrated air particle (CAP) system in Sterling Forest, New York. The system draws in and concentrates ambient air through a cyclone inlet that first removes most of the particles larger than 2.5μm in aerodynamic diameter. The cyclone outflow is passed over the warm bath of water and is then rapidly cooled in the condenser, resulting in supersaturation and particle growth (Maciejczyk et al., 2005). The pDR-1500 was initially calibrated with ambient particles and the internal gravimetric filter and pump system at a flow rate of 1.5 L/min in the CAP chamber. The individual AirBeam units were then calibrated with the pDR-1500; first, the individual AirBeam units were placed within the CAPS chamber together with the pDR-1500 and tested for approximately 3 to 4 hour periods per day and between 2 to 3 days per unit, and separate linear regression models were fit for each unit.

2.2. Sampling Location and Protocol

Seoul, the capital of South Korea and the 5th most populous metropolitan area in the world, experiences one of the highest air pollution concentration levels among cities in developed countries. The city is characterized by extremely high urban density, abundance of high-rise buildings and apartments, and a mountainous terrain. This study was carried out in the southern part of Seoul, south of the Han River, in three districts: Dongjak-gu (area=16.35 km2; population density=24,000/km2), Seocho-gu (area=47.14 km2; density=8,300/km2), and Gwanak-gu (area=29.57 km2; density=18,000/km2). The sampling campaign was conducted during an approximately three-week period (July 23rd to August 11th) in the summer of 2015, on weekdays only (12 days total, on non-rainy days) during three different time periods: morning (8–10am), evening (6–8pm), and night (9–11pm). Ten volunteers sharing 7 AirBeam units were instructed to repeatedly sample the five routes without predetermined beginning/ending locations and times.

The five routes (Figure 1), four of which were based near or around government-run regulatory monitors, were designed to span various neighborhoods and to obtain spatial coverage of a wide range of types of geographical variables, such as major roads and highways, green spaces, and both low and high density residential areas. Route A is located in Sillim; the neighborhood is largely residential with low-rise buildings and houses. Route B is in Sadang, which is also mainly residential with a large park and three major roads that surround the neighborhood. Route C is in Seocho, where the central bus transport terminal for Seoul is located, as well as the main city highway, a riverside park, and high-rise apartment buildings. Route D is located at Isu, where major highways and high-density residential areas are present. Route E is located near Seoul National University, a large university campus located at the base of a mountain; the area is hilly and tree-covered, and has a relatively low volume of traffic, mainly consisting of buses used for student transport. The lengths of the routes ranged from 3.9km to 4.9km, and the total sum length of all the routes was 21.5km.

Figure 1.

Figure 1.

(a) Locations of the five sampling routes in Seoul and government-run, fixed-site monitors (blue markers). Mean PM2.5 concentration levels (μg/m3) during the sampling period at each of the 100m segments are also depicted. The red arrows point to an underground roadway, which was not included in analyses. We also present close-up views of route C as an example to depict sampled data points, with (b) OpenStreetMap and (c) satellite backgrounds.

2.3. Data Source for Land Use Predictors

Geospatial data for the city of Seoul, South Korea were downloaded from OpenStreetMap (OSM), a freely available, crowd-sourced and user-generated online mapping system. The dataset included more than 60 variables, grouped by the following categories: roads (cycleway, footway, living, path, pedestrian, residential, primary, secondary, road, secondary link, service, steps, subway, tertiary, trunk, trunk link, unclassified); land use (cemetery, farm, footway, forest, garden, golf, grass, hospital, island, park, parking, pitch, place of worship, playground, residential, school, sports center, substation, university, wood); buildings (apartments, cathedral, church, commercial, hospital, hotel, house, public, residential, retail, school, university, identified/unidentified), public amenities (fire station, fuel station, hospital, library, police, school, town hall); transportation points (bus stop, motorway junction, station, subway entrance); and water areas and waterways (stream, river, riverbank, water). Several variables in different categories that repeatedly describe the same land use morphology – e.g. “university”, which is counted as land use, buildings, and public amenities – were all initially included in the analysis. After removing the subway variable (as it describes underground paths), there were 67 predictor variables available for analysis (Supplemental Table 1).

2.4. Data Reduction

As the frequency of data collection was in 1-second intervals, the data points were first aggregated into 1-minute averages to match the pDR-1500 sampling frequency and to reduce data noise. Measurement points with obvious GPS (e.g. located in middle of rivers) and sampling errors (e.g. volunteer did not follow sampling route properly) were removed by restricting data points to <50M away from the routes and also by manually after visual inspection. We then employed a “snapping” procedure to assign the collected data points to the nearest route segment on the basis of measured GPS coordinates to allow measurements along the same segment to be analyzed as a group, as per previous mobile LUR studies (Hankey and Marshall, 2015). Segments were defined by length from a starting point along a route, and buffers with different radiuses were then drawn around centroids of the route segments, with geospatial data from OSM within the buffers then extracted. Each road segment was thereby associated with land use, built, and natural environment variables, calculated as different OSM variables within the buffers of different sizes. We calculated road segments at 5 different lengths (25M, 50M, 100M, 150M, 250M) and 5 buffer radiuses (50M, 100M, 150M, 350M, 500M) in order to build the LUR models as well as to assess how these parameters influence the LUR model performance.

2.5. Adjustment for Background Temporal Trends

Previous mobile sampling investigations adjusted for potential temporal bias through several approaches; for example, Tessum et al. (2017) adjusted for between-day temporal trends by subtracting the daily fifth percentile from all measured concentration values on a given day. Deville Cavellin et al. (2016) used linear and quadratic terms for temperature as independent variables in the model as adjustment for potential temporal variability. We modified an approach applied by multiple studies (Larson et al., 2009; Dons et al., 2012; Clougherty et al., 2013; Van den Bossche et al., 2015; Apte et al., 2017) that used background concentration levels from a nearby regulatory monitor to adjust for temporal trends and normalize measured values. Leveraging the available information on background PM2.5 concentrations from multiple fixed-site regulatory monitors nearby the sampling routes, we adjusted each 1-minute averaged measurements from AirBeams for each day by applying a multiplicative hourly factor (defined as the ratio of mean concentration level during the entire sampling period to corresponding hour in which that measurement is taken) derived from the nearby regulatory monitor. For route E, which was not designed around a regulatory monitor, we used averaged values from the two nearby monitors (approx. 2 – 4km away) located by routes A and B. This resulted in 6 factors per each sampling day for each of the 5 routes. Using multiple nearby monitors, instead of a single monitor as done in past studies, allowed for variable temporal adjustments across several locations. This approach minimizes the effect of day-to-day variations in background air quality on the measurements, thereby decreasing the amount of required sampling data (Van den Bossche et al., 2015). Hourly measurements from regulatory monitors in Seoul revealed considerable temporal variability during the study period, with hourly PM2.5 levels as low as 5 μg/m3 and reaching 67 μg/m3 during pollution episodes (Figure 2).

Figure 2.

Figure 2.

Hourly (at 8am, 9am, 6pm, 7pm, 9pm, 10pm) PM2.5 concentration levels during the sampling period (7/23/15 to 8/10/15) at the four regulatory background monitors

2.6. LUR Model Building

We first tested effects of spatial aggregation by different route segment lengths and buffer sizes in the linear regression model by including all available 67 variables into a linear regression model, and we selected 100m route segments to spatially aggregate the collected data points based on the high adj-R2, resulting in 215 available segments for subsequent analyses. We then applied and compared three statistical approaches for building the LUR model: linear regression (LR), random forest (RF), and stacked ensemble (SE).

In the linear regression model, the GIS variables were retained for multivariable models based on a distance-decay regression selection strategy (ADDRESS) to screen and select informative candidate variables and corresponding buffer size from all of the available potential variables (Su et al., 2015). We then applied a supervised forward search approach, adding the variables one at a time in the LR model and keeping the variable only if it increased the R2 of the model by 1.0% and if all predictor variables have statistically significant coefficients (p<0.05) (Van den Bossche et al., 2018). We also applied the random forest (RF) model, first removing highly correlated variables (absolute correlation>0.8). Random forests, in brief, are an ensemble of decision trees and each tree is constructed using the best split for each node among a subset of predictors randomly chosen. Random search, which randomly chooses combination of hyperparameters at every iteration, was used to tune and optimize the model (Bergstra and Bengio, 2012). Finally, we employed the stacked ensemble (SE) model, a machine learning ensemble approach that involves training a learning algorithm to combine the predictions of several other learning algorithms; first, all of the other algorithms are trained using the available data, then a ‘meta-classifier’ algorithm (chosen from the list of algorithms) is trained to make a final prediction combine all the predictions of the other algorithms as additional inputs. We evaluated and selected a diverse group of machine learning algorithms, including random forest (‘rf’), Bayesian generalized linear model (‘bayesglm’), k-nearest neighbors (‘knn’), recursive partitioning and regression trees (‘rpart’), and partitioning using deletion, substitution, and addition moves (‘partDSA’).

We applied 10-fold cross validation (with 500 repeats) to calculate mean CV-R2 (cross-validation R2; 1-(mean square error/variance)) and root mean square errors (RMSE; a measure of the differences between values predicted by a model and the values observed) for the three methods to quantify their accuracy. We used packages ‘ggplot2’ and ‘leaflet’ for visualization and ‘caret’ for statistical analyses in R (version 3.4.4).

3. RESULTS

3.1. Adjustment for Intra-Instrument Variability

We fit univariate linear regression models for each of the deployed Airbeam unit in order to adjust for intra-unit variability and to convert particle counts to PM2.5 mass concentrations. During the collocated sessions with the DataRAM pDR-1500 in the CAP chamber, the PM2.5 concentration (as measured by pDR-1500) ranged from 0 to 81 μg/m3. The AirBeams revealed strong agreements with the pDR-1500 (adj-R2=0.95–0.98) and noticeable differences in responses between the individual units (Figure 3). The regression models’ intercepts, slopes, and RMSE values varied across the units; detailed statistical summaries of the models are presented in Table 1.

Figure 3.

Figure 3.

DataRam pDR-1500 (mass; μg/m3) vs. 1-minute averaged AirBeam (hundreds of particles per cubic feet; hppcf) measurements in the concentrated air particle chamber (CAP)

Table 1.

Linear regression equations to convert particle counts to mass for each of the AirBeam unit

Unit Name Intercepts (Standard Error) Slope (Standard Error) RMSE Adj-R2
99B −10.72(0.29) 0.002616 (1.77 × 10–5) 23.48 0.95
B7E −11.69(0.44) 0.001974 (1.41 × 10–5) 14.84 0.98
B99 −13.16(0.42) 0.002102 (1.47 × 10–5) 15.50 0.98
C54 −6.68(0.18) 0.001905 (1.35 × 10–5) 8.92 0.96
C58 −4.91 (0.16) 0.002000 (1.05 × 10–5) 9.16 0.98
C72 −9.94 (0.34) 0.002537 (3.17 × 10–5) 10.50 0.95
D46 −11.26(0.37) 0.002049 (1.64 × 10–5) 11.39 0.98

3.2. Mobile Sampling Summary Statistics

The mobile sampling campaign yielded a total of 10871 minutes of data, of which after removing GPS and sampling errors, 10177 minutes (93.6%) of data remained, equaling more than 169 hours of total data across the 5 sampled routes (Table 2, Supplemental Tables 2 & 3). 1992 minutes (33.2 hours) of sampling data were collected at Route A; 2449 minutes (40.8 hours) at Route B; 2313 minutes (38.6 hours) at Route C; 1970 minutes (32.8 hours) at Route D; and 1453 minutes (24.2 hours) at Route E. Route D, which is located near major roads and highways, had the highest concentration levels (55.5 ± 27.7 μg/m3), while Route B (42.0 ± 24.2 μg/m3) and Route E (48.4 ± 31.3 μg/m3) had the lowest concentration levels. Notable differences between morning, evening, and night were also observed across the five routes, especially for Route D, which had elevated levels during morning (70.7 ± 25.5 μg/m3) compared to evening (46.6 ± 28.3 μg/m3) and night (54.8 ± 24.1 μg/m3). The amount of sampling data varied across the 215 segments, with a median of 44 minutes per segment (minimum=5; 25% percentile=34; 75% percentile=55; maximum=179). Summary statistics for minutes of sampling per 100m segment for each of the five routes are visualized as boxplots in Figure 4.

Table 2.

Summary statistics for measurements across the five routes

Total Morning Evening Night
Route ID Route Name Air Beam Units Deployed Minutes Sampled Average (Std. Dev), μg/m3 IQR Minutes Sampled Average (Std. Dev), μg/m3 IQR Minutes Sampled Average (Std. Dev), μg/m3 IQR Minutes Sampled Average (Std. Dev), μg/m3 IQR
A Sillim B7E, B99, C54, D46 1992 51.3 (32.6) 56.3 901 43.3 (32.8) 40.4 462 58.6 (27.9) 49.6 629 57.4 (33.0) 47.4
B Sadang 99B, B7E, C58, D46 2449 42.0 (24.2) 36.5 744 40.5 (24.8) 34.2 858 38.6 (21.2) 38.1 847 46.7 (25.8) 30.3
C Seocho 99B, C58, C72 2313 49.9 (31.4) 50.3 574 47.9 (35.2) 69.2 892 46.7 (29.4) 43.3 847 54.5 (30.1) 38.1
D Isu 99B, C72, D46 1970 55.5 (27.7) 33.9 477 70.7 (25.5) 39.6 755 46.6 (28.3) 43.3 738 54.8 (24.1) 26.1
E Seoul National University B7E, B99, C54, D46 1453 48.4 (31.3) 48.4 396 56.8 (39.4) 74.3 528 47.4 (26.5) 23.2 529 43.0 (27.4) 54.0

Figure 4.

Figure 4.

Boxplot demonstrating distribution of minutes of sampling per 100m segment for each sampling route.

3.3. Model Results

The LUR models were sensitive to different segment lengths and buffer radiuses, with R2 generally increasing with larger buffer radiuses (Figure 5), while 100m to 150m segments for spatial aggregation performed the best. Fitting individual equations to account for intra-instrument variability for each AirBeam unit generally improved the accuracy of the constructed LUR models, with an increase in CV-R2 values by ~0.10–0.15.

Figure 5.

Figure 5.

Adjusted R2 of LR LUR models (including all available 67 predictor variables) for mass, by segment radius and buffer sizes

In constructing the LR model, we screened and removed several point variables (e.g. fire stations) that were not frequently present across the sampling space but clustered near the pollution hotspots, as these variables ended up having very strong influences on the models. The final LR LUR model showed high goodness-of-fit with a CV-R2 of 0.63 and RMSE of 7.01, and the following variables were included in the model: wood, secondary link, residential road, cathedral, station, pitch, and apartments (Table 3). The machine learning approaches explained a greater proportion of the variance of PM2.5 concentrations than the LR model. The random forest model identified mostly different variables as important (wood, residential road, living street, school, park, apartments, residential, building, tertiary, and service) and also revealed better performance metrics compared to the LR model, with higher mean CV-R2 (0.73) and lower RMSE (6.20). The stacked ensemble model with random forest as the meta-predictor algorithm performed the best, and the SE model outperformed both LR and RF models, with higher CV-R2 (0.80) and lower RMSE (5.22). Individual R2 values for the algorithms in the ensemble were 0.74 for random forest, 0.45 for partDSA, 0.50 for rpart, 0.70 for bayesglm, and 0.69 for knn.

Table 3.

Selected LUR model predictor variables in the LR and RF models and associated statistics.

Linear Regression Random Forest*
Variable Name Variable Type Buffer Length β Std. Error p-value Importance
Intercept 50.02 1.21 <0.001
Wood Area 500m −3.80 × 10−5 4.25 × 10−6 <0.001 14.45
Residential Road Line 500m 2.59 × 10−5 5.88 × 10−6 <0.001 13.10
Secondary Link Line 500m 6.88 × 10−3 1.00 × 10−3 <0.001
Cathedral Point 500m −2.47 × 10−3 1.03 × 10−3 0.02
Station Point 500m −3.75 1.02 <0.001
Pitch Area 350m −1.88 × 10−4 4.28 × 10−5 <0.001
Apartments Point 500m 7.70 × 10−5 4.02 × 10−5 0.05 10.21
School Area 500m 10.73
Living Street Line 500m 10.85
Park Area 500m 10.31
Residential Area 500m 9.93
Building (Unclassified) Area 500m 9.72
Tertiary Line 350m 9.70
Service Line 350m 8.97
*

Top ten variables by variable importance are shown in the table

Adjusting for background temporal trends changed the overall morning average concentration levels from 49.4 to 59.2 μg/m3; evening from 46.4 to 45.7 μg/m3; and night from 51.5 to 47.3 μg/m3. The changes in concentration levels after temporal adjustment during the three sampling periods differed significantly across the routes (Supplemental Table 4). This adjustment also improved the CV-R2 for the three approaches, as not doing so resulted in lower CV-R2 values of 0.54, 0.65, and 0.71 for the LR, RF, and SE models, respectively. The constructed LUR models were used to create prediction maps of street-level PM2.5 concentration levels in Seoul nearby the sampled locations, which revealed several ‘hotspots’ with elevated PM2.5 levels (Figure 6). The prediction maps revealed similar spatial patterns between the three modeling approaches with emphasis on similar locations as hotspots, especially at locations with major roads/highways and high population density. Conversely, the lowest concentrations were predicted at greenspace locations, such as parks and mountains. The three approaches resulted in relatively similar mean predicted values across the exposure surface, at 47.31, 48.86, and 49.43 μg/m3, for LR, RF, and SE, respectively. However, the LR prediction map predicted lower values than machine learning approaches at the extremes (range: 26.36–68.96 μg/m3), while maps for RF (34.97–71.43 μg/m3) and especially SE (33.50–83.19 μg/m3) models resulted in higher predicted values.

Figure 6.

Figure 6.

Figure 6.

Figure 6.

PM2.5 prediction maps nearby sampled areas constructed applying (a) linear regression, (b) random forest, and (c) stacked ensemble approaches

4. DISCUSSION

In this study, we conducted a mobile sampling campaign in Seoul, South Korea deploying low-cost smartphone-based air quality sensors and utilized the collected data to construct LUR models employing three statistical approaches. The strengths of the resulting R2 values were comparable to recent, similar studies across multiple locations around the world that utilized more advanced equipment. Our study is unique for developing LUR models using multiple low-cost (<$300), mobile sensors; priced at $250 per unit, AirBeams are order(s) of magnitude less expensive than the commercially available portable (in the thousands; the pDR-1500 used in this study cost ~$5,700) and federal standard (in the tens of thousands) instruments. AirBeam and its operating platform, Aircasting, is also notable for being primarily developed for citizen science whereby users can upload their measurements to share with the public, as well as for being open-sourced, allowing developers and researchers to program and customize the instruments and the smartphone app according to their needs and requirements. Many similarly priced ($200-$300) sensors have entered the market since the present study was conducted, underlining the public’s increasing interest in the capability to measure personalized real-time exposure data (Caplin et al., 2019). Through deployment of such low-cost sensors, we were able to characterize the spatial variability of street-level PM2.5 in Seoul, the main source of which is likely to be from traffic given the near-road sampling approach applied in this study. Past source apportionment studies also identified the primary source of PM2.5 in Seoul as motor vehicle emissions and road dust (Heo et al., 2008; Ryou et al., 2018).

Recent mobile sampling approaches for LUR model building have employed a variety of study designs and instruments. For example, Hankey and Marshall (2015) collected over 85 hours of data on a bicycle-based sampling platform in Minneapolis, MN and constructed LUR models for particle size, black carbon, and PM2.5 with modest goodness-of-fit (adj-R2 of ~0.5 for particle number and ~0.4 for PM2.5). Apte et al. (2017) analyzed data collected from a Google Street View mapping vehicle equipped with air quality sensors that repeatedly sampled every street in a 30-km2 area of Oakland, CA, to model and reveal urban air pollution patterns at 4–5 orders of magnitude greater spatial precision than possible with current central-site ambient monitoring. The ‘OpenSense’ project in Zurich, Switzerland (Hasenfratz et al. 2015) utilized mobile sensor nodes installed on top of public transport tram vehicles in the city to create high-resolution pollution prediction maps for ultrafine particles and particle counts. Vehicle-based mobile measurements were also applied to create LUR models to estimate the spatial variation of street-level PM2.5 and PM10 in the downtown area of Hong Kong (Shi et al., 2016), and integration of urban/building morphology as independent variables increased the adj-R2 of the LUR model, suggesting that incorporating detailed 3D characteristics of the land use can improve the predictive power of such models.

Our study and sampling design highlight the potential advantages of mobile sampling with low-cost and portable air quality sensors in constructing the LUR models. The aforementioned studies were largely based on sampling campaigns conducted on modes of transport (e.g. cars) visiting a single location at a given time, which may potentially result in a low number of visits per location. The results from this and past studies found that mobile LUR models are highly sensitive to parameters such as the number of route segments, radiuses of buffers, and number of measurements per segment (Minet et al., 2017). Hatzopoulou et al. (2017) evaluated the influence of the number of sampling locations and durations of sampling on LUR model performance, noting that mobile sampling campaigns can be inefficient due to low sampling frequency at a large number of locations, and that spatial variability may be more important than the numbers of locations when designing sampling routes. The authors also found that the LUR models became relatively robust after 150–200 segments and 10–12 visits per segment. In the present study, walking at a slow speed, instead of on mechanical modes of transportation, resulted in each route generally having a high number of data points (median=44) per segment. This approach also allows for assessing personal-level exposure in urban areas where there are a larger number of people on the streets than in cars. The disadvantage of shorter distances being covered when sampling on foot was offset by the low cost and portability of AirBeams, which allowed for several units that could be deployed simultaneously across multiple locations at a given time and thereby maximize spatial coverage, as opposed to the majority of past mobile sampling studies that were carried out on a single platform. Simultaneous measurements within a structured sampling design could decrease the amount of collected data (and manpower) required to construct robust models, whereas participatory sensing where sampling is done ‘opportunistically’ could lead to unstructured data that is more difficult to interpret (Van den Bossche et al., 2016). Furthermore, AirBeam’s ease of operation meant that minimal training (a few minutes at most) was required prior to field deployment, resulting in a relatively large volume of data being generated within the short sampling campaign period during this study.

This study leveraged OpenStreetMap (OSM), an openly available and crowd-sourced GIS dataset, which provided a rich and comprehensive source of geospatial data for a wide range of LUR variables. OSM and other ‘open data’ sources offer underexplored but valuable information for data-driven methods to predict air pollution levels (VoPham et al., 2018). Notably, the OSM GIS variables were highly developed for Seoul and provided detailed and differentiated data for the numerous types of roads and buildings, which are land use categories that usually provide the highest predictive power for air pollution LUR models. Another advantageous aspect of crowd-sourced data is that it is continually updated; for example, using an earlier download of OSM from September 2015 (versus January 2018 in this analysis) with less developed characterization of Seoul resulted in a LUR model with a lower CV-R2 (0.55 for LR), suggesting that in locations with lacking geospatial data, crowd-sourced efforts to generate the relevant GIS variables could be carried out in concert with the air pollution sampling campaign to strengthen the predictive capability of LUR models. Despite recent endeavors to democratize data by agencies and organizations throughout the world as part of the ‘open data’ movement, many detailed GIS files remain proprietary and thereby cost-prohibitive, and freely available data like OSM offer an alternative and important source of detailed spatial data for researchers and communities.

Machine learning methods offered improved goodness-of-fit compared to traditional stepwise linear regression in constructing the LUR models. Prior work on machine learning applications in both national (Hu et al., 2017; Di et al., 2016) and local-level (Adams and Kanaroglou, 2016; Weichenthal et al., 2016; Brocamp et al., 2017) predictions of air pollution concentration levels highlight the advantages associated with the approach, including higher accuracy and identification of important variables. A recent example further underlines additional potential benefits; a study in Los Angeles, USA used a multi-step and flexible spatial data mining approach using machine learning to select for most important OSM geographic features and predict PM2.5 concentrations, removing the need for a priori selection of predictors for exposure modeling (Lin et al., 2018). Similarly in our analysis, applying the traditional step-wise linear regression LUR approach with the highly correlated OSM dataset, which also contained several highly influential variables, required manual screening and removal of predictor variables prior to input and during the model building process. Notably, the stacked ensemble model combining multiple machine learning algorithms outperformed both LR and RF in this study. In recent years ensemble machine learning methods have emerged as an important tool for modeling complex relationships and have been applied successfully in various research areas (Yang et al., 2010). Application of ensembles have been generally limited in air pollution exposure assessment and modeling efforts to date, and the results here suggest that ensemble-based approaches could further enhance the predictive performance of LUR models.

We note several potential weaknesses that are present in this study. As we evaluated the AirBeam units in a carefully controlled experimental chamber drawing in air from a forested and rural area (Tuxedo, New York), the particle composition and the environmental conditions (e.g., humidity and temperature) encountered during the experiment are likely to be significantly different from the heavily urban location where this study was carried out. Although the potential impacts of these factors were not assessed in this study, previous performance evaluations of AirBeams in various laboratory and field settings offer insight. The initial manufacturer calibration was conducted in a similarly urban setting (New York City), which revealed high correlations with both gravimetric sampling and pDR-1500 (takingspace.org). Comparison against federal equivalent method monitors showed high agreements with GRIMM (R2~0.6–0.8) (Mukherjee et al., 2017; SCAQMD 2017; Feinberg et al., 2018), but mixed results were observed with BAM (R2~0.2–0.7) (Jiao et al., 2016; SCAQMD 2017). A study of sensor responses to Arizona road dust, salt, and welding fumes (Sousan et al., 2017) demonstrated that particle types had significant impacts on AirBeam (and other low-cost sensors) measurements. Relative humidity (RH) levels also influenced the measurements; a laboratory evaluation found that bias was observed when both RH (>65%) levels and concentration levels (>100 μg/m3) were elevated (SCAQMD 2017), while another study (Feinberg et al., 2018) found that that particle counts measurements were affected by higher humidity levels in a field setting. Highly humid summers in Korea would likely influence the absolute measurement values, but the potential impact on prediction model performance is likely to be minimal as the spatial variability of humidity levels is likely to be uniform across a city. Nevertheless, these findings emphasize the need to consider the potential influence of environmental factors in sensor deployments, and performance evaluations at the study location is suggested for similar studies applying low-cost sensors. In addition, the particle concentration levels encountered during sampling in Seoul were higher than the range used for constructing calibration equations for the AirBeam units, which may ignore the potential nonlinearity of sensor responses. We also did not check for potential sensor drift – a common issue for low-cost air quality sensors – during and after the mobile sampling, although this is unlikely due to the relatively short sampling period. These issues may have contributed to predicted values that were significantly higher than observed values from nearby fixed-site monitors, although it is also possible that such differences are due to the fact that fixed-site monitors are often located well above ground and tend to underestimate personal exposures when walking near traffic (Deville Cavellin et al., 2016). Another potential weakness is that the OSM data quality and density could be potentially uneven across locations, as some areas could be characterized in more detail than others. For example, in some of the sampled areas in this study, several of the houses in residential areas were not captured in the OSM file and thereby could have influenced model quality; however, as OSM data coverage and quality continues to improve this should become less of an issue over time.

5. CONCLUSIONS

Low-cost sensors represent an opportunity to bridge the data gap, thereby promoting public discourse, influencing air pollution regulations, and protecting public health (Amegah 2018). This study highlights the advantages and potential of applying data collected from mobile sampling with multiple low-cost sensors to model and map street-level air pollution levels in urban locations, especially the capability to generate a large volume of sampling data with ease. The predictive power of models developed here, despite deploying only a limited number of significantly less expensive, consumer-based air quality sensors, were comparable to the past mobile sampling LUR studies, especially after adjusting for intra-instrument variability and temporal trends. To minimize the potential influence of local particle characteristics and environmental conditions, calibration with collocated reference monitors at the sampling location is suggested for future projects using similar low-cost sensors, as well as to convert particle counts to mass concentration, a unit of measurement that is more readily transferable for policy-relevant metrics. Initial calibrations should also carefully evaluate and adjust for the potential effects of relative humidity levels, which can have significant influences on readings from low-cost sensors. Overall, the findings here suggest that similar mobile sampling designs using low-cost sensors and ‘open data’ sources could be applied to generate a large volume of data and construct LUR models and maps with fine spatial granularity, and that machine learning methods can further improve model performance. Our study design and approach may be especially suitable for citizen science and community-based endeavors, or in locations without preexisting air monitoring networks, such as developing countries.

Supplementary Material

1

Highlights.

  • Mobile sampling with low-cost (<$300) air quality sensors could offer a highly cost-effective approach to characterize urban street-level air quality.

  • A mobile sampling campaign deploying multiple AirBeams across five routes was conducted during an approximately three week period in Seoul, South Korea.

  • Land use regression (LUR) models were constructed using the collected data and the OpenStreetMap (OSM) geospatial data.

  • Three approaches – linear regression, random forest, and stacked ensemble – were employed to construct the LUR models, with the stacked ensemble model having the highest predictive power.

Acknowledgement

This study was funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the South Korea Ministry of Education (2018R1A2B6004608), the NSF East Asia and Pacific Summer Institute (EAPSI) Fellowship, the Air & Waste Management Air Pollution Education and Research Grant (APERG), and EPA STAR Graduate Fellowship. This publication was developed under Assistance Agreement No. FP917825 awarded by the U.S. Environmental Protection Agency to Chris C. Lim. It has not been formally reviewed by EPA. The views expressed in this document are solely those of the authors and do not necessarily reflect those of the Agency. EPA does not endorse any products or commercial services mentioned in this publication.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Adams MD, & Kanaroglou PS (2016). Mapping real-time air pollution health risk for environmental management: Combining mobile and stationary air pollution monitoring with neural network models. Journal of Environmental Management, 168, 133–141. [DOI] [PubMed] [Google Scholar]
  2. Amegah AK (2018) ‘Proliferation of low-cost sensors. What prospects for air pollution epidemiologic research in Sub-Saharan Africa ?’, 241, pp. 1132–1137. [DOI] [PubMed] [Google Scholar]
  3. Apte JS, Messier KP, Gani S, Brauer M, Kirchstetter TW, Lunden MM, … Hamburg SP (2017). High-Resolution Air Pollution Mapping with Google Street View Cars : Exploiting Big Data. Environmental Science & Technology. [DOI] [PubMed] [Google Scholar]
  4. Bergstra James, and Bengio Yoshua. “Random search for hyper-parameter optimization.” Journal of Machine Learning Research 13February (2012): 281–305. [Google Scholar]
  5. Borghi F, Spinazz A, Rovelli S, Campagnolo D, Buono L Del Cattaneo, A., & Cavallo DM (2017). Miniaturized Monitors for Assessment of Exposure to Air Pollutants : A Review. [DOI] [PMC free article] [PubMed]
  6. Brokamp C, Jandarov R, Rao MB, LeMasters G, & Ryan P (2017). Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches. Atmospheric Environment, 151, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Caplin A, Ghandehari M, Lim C, Glimcher P, & Thurston G (2019). Advancing environmental exposure assessment science to benefit society. Nature communications, 10(1), 1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Catlett CE, Beckman PH, Sankaran R, & Galvin KK (2017, April). Array of things: a scientific research instrument in the public way: platform design and early lessons learned. In Proceedings of the 2nd International Workshop on Science of Smart City Operations and Platforms Engineering (pp. 26–33). ACM. [Google Scholar]
  9. Castell N, Dauge FR, Schneider P, Vogt M, Lerner U, Fishbain B, … Bartonova A (2017). Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates? Environment International, 99, 293–302. [DOI] [PubMed] [Google Scholar]
  10. Clougherty JE, Kheirbek I, Eisl HM, Ross Z, Pezeshki G, Gorczynski JE, Johnson S, Markowitz S, Kass D, Matte T, 2013. Intra-urban spatial vari- ability in wintertime street-level concentrations of multiple combustion- related air pollutants: the New York City Community Air Survey (NYCCAS). J. Expo. Sci. Environ. Epidemiol 23, 232e240. [DOI] [PubMed] [Google Scholar]
  11. Deville Cavellin L, Weichenthal S, Tack R, Ragettli MS, Smargiassi A, & Hatzopoulou M (2016). Investigating the Use Of Portable Air Pollution Sensors to Capture the Spatial Variability Of Traffic-Related Air Pollution. Environmental Science & Technology, 50(1), 313–320. [DOI] [PubMed] [Google Scholar]
  12. Dons E, Int Panis L, Van Poppel M, Theunis J & Wets G Personal exposure to Black Carbon in transport microenvironments. Atmos. Environ 55, 392–398 (2012). [Google Scholar]
  13. English PB, Olmedo L, Bejarano E, Lugo H, Murillo E, Seto E, Wong M, King G, Wilkie A, Meltzer D and Carvlin G, 2017. The Imperial County Community Air Monitoring Network: a model for community-based environmental monitoring for public health action. Environmental health perspectives, 125(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Feinberg S et al. Long-term evaluation of air sensor technology under ambient conditions in Denver, Colorado. Atmos. Meas. Tech 11, 4605–4615 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gao M, Cao J & Seto E A distributed network of low-cost continuous reading sensors to measure spatiotemporal variations of PM2.5 in Xi’an, China. Environ. Pollut. 199, 56–65 (2015). [DOI] [PubMed] [Google Scholar]
  16. Hankey S, & Marshall JD (2015). Land Use Regression Models of On-road Particulate Air Pollution (Particle Number, Black Carbon, PM2.5, Particle Size) Using Mobile Monitoring. Environmental Science & Technology, 150702142726005. [DOI] [PubMed] [Google Scholar]
  17. Hasenfratz D et al. Deriving high-resolution urban air pollution maps using mobile sensor nodes. Pervasive Mob. Comput 16, 268–285 (2015). [Google Scholar]
  18. Hatzopoulou M, Valois MF, Levy I, Mihele C, Lu G, Bagg S, … Brook J (2017). Robustness of Land-Use Regression Models Developed from Mobile Air Pollutant Measurements. Environmental Science and Technology, 51(7), 3938–3947. [DOI] [PubMed] [Google Scholar]
  19. Heo JB, Hopke PK, Yi SM, 2008. Source apportionment of PM2.5 in Seoul, Korea. Atmos. Chem. Phys. Discuss 8, 20427e2046 [Google Scholar]
  20. Hu X, Belle JH, Meng X, Wildani A, Waller LA, & Strickland MJ (2017). Estimating PM 2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. [DOI] [PubMed]
  21. Jerrett M et al. Validating novel air pollution sensors to improve exposure estimates for epidemiological analyses and citizen science. Environ. Res 158, 286–294 (2017). [DOI] [PubMed] [Google Scholar]
  22. Jiao W et al. Community Air Sensor Network (CAIRSENSE) project: Evaluation of low-cost sensor performance in a suburban environment in the southeastern United States. Atmos. Meas. Tech. Discuss 1–24 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kelly KE, et al. “Ambient and Laboratory Evaluation of a Low-Cost Particulate Matter Sensor.” Environmental Pollution, vol. 221, Feb. 2017, pp. 491–500, doi: 10.1016/j.envpol.2016.12.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kumar P, Morawska L, Martani C, Biskos G, Neophytou M, Di Sabatino S, Bell M, Norford L and Britter R, 2015. The rise of low-cost sensing for managing air pollution in cities. Environment international, 75, pp.199–205. [DOI] [PubMed] [Google Scholar]
  25. Larson T, Gould T, Riley EA, Austin E, Fintzi J, Sheppard L, … Simpson C (2017). Ambient air quality measurements from a continuously moving mobile platform: Estimation of area-wide, fuel-based, mobile source emission factors using absolute principal component scores. Atmospheric Environment, 152, 201–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zamora Levy, Misti, et al. “Field and Laboratory Evaluations of the Low-Cost Plantower Particulate Matter Sensor.” Environmental Science and Technology, vol. 53, no. 2, American Chemical Society, 2019, pp. 838–49, [DOI] [PubMed] [Google Scholar]
  27. Lin Y, Chiang Y-Y, Pan F, Stripelis D, Ambite JL, Eckel SP, Habre R. Mining public datasets for modeling intra-city PM2.5 concentrations at a fine spatial resolution. In: Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems. Los Angeles area, CA: ACM; 2017. p. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Maciejczyk P, Zhong M, Li Q, Xiong J, Nadziejko C, & Chen LC (2005). Effects of subchronic exposures to concentrated ambient particles (CAPs) in mice: II. The design of a CAPs exposure system for biometric telemetry monitoring. Inhalation toxicology, 17(4–5), 189–197. [DOI] [PubMed] [Google Scholar]
  29. McKercher GR, Salmond JA & Vanos JK Characteristics and applications of small, portable gaseous air pollution monitors. Environ. Pollut (2017). doi: 10.1016/j.envpol.2016.12.045 [DOI] [PubMed] [Google Scholar]
  30. Minet L, Gehr R, & Hatzopoulou M (2017). Capturing the sensitivity of land-use regression models to short-term mobile monitoring campaigns using air pollution micro-sensors. Environmental Pollution, 230, 280–290. [DOI] [PubMed] [Google Scholar]
  31. Morawska L et al. Applications of low-cost sensing technologies for air quality monitoring and exposure assessment: How far have they gone? Environ. Int 116, 286–299 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Mukherjee Anondo, Stanton Levi, Graham Ashley, and Roberts Paul. “Assessing the Utility of Low-Cost Particulate Matter Sensors over a 12-Week Period in the Cuyama Valley of California.” Sensors 17, no. 8 (August 5, 2017): 1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ryou H, Heo J, & Kim SY (2018). Source apportionment of PM10 and PM2. 5 air pollution, and possible impacts of study characteristics in South Korea. Environmental pollution, 240, 963–972. [DOI] [PubMed] [Google Scholar]
  34. Schneider P, Castell N, Vogt M, Dauge FR, Lahoz WA, & Bartonova A (2017). Mapping urban air quality in near real-time using observations from low-cost sensors and model information. Environment International, 106 (December 2016), 234–247. [DOI] [PubMed] [Google Scholar]
  35. Shi Y, Lau KKL, & Ng E (2016). Developing Street-Level PM2.5 and PM10 Land Use Regression Models in High-Density Hong Kong with Urban Morphological Factors. Environmental Science and Technology, 50(15), 8178–8187. [DOI] [PubMed] [Google Scholar]
  36. Sousan S, Koehler K, Hallett L & Peters TM Evaluation of consumer monitors to measure particulate matter. J. Aerosol Sci. 107, 123–133 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. South Coast AQMD (2017). AirBeam Summary Report [online] Available at: http://www.aqmd.gov/aqspec/sensordetail/airbeam [Accessed 21 Jun. 2018].
  38. Su JG, Hopke PK, Tian Y, Baldwin N, Thurston SW, Evans K, & Rich DQ (2015). Modeling particulate matter concentrations measured through mobile monitoring in a deletion/substitution/addition approach. Atmospheric Environment, 122, 477–483. [Google Scholar]
  39. Takingspace.org. (2014). AirBeam Technical Specifications, Operation & Performance: Taking Space. [online] Available at: http://www.takingspace.org/airbeam-technical-specifications-operation-performance/ [Accessed 22 Jun. 2018].
  40. Tessum MW et al. Mobile and Fixed-Site Measurements To Identify Spatial Distributions of Traffic-Related Pollution Sources in Los Angeles. Environ. Sci. Technol. acs.est 7b04889 (2018). doi: 10.1021/acs.est.7b04889 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Van den Bossche J, Peters J, Verwaeren J, Botteldooren D, Theunis J, & De Baets B (2015). Mobile monitoring for mapping spatial variation in urban air quality: Development and validation of a methodology based on an extensive dataset. Atmospheric Environment, 105, 148–161. [Google Scholar]
  42. Van den Bossche Joris, Theunis Jan, Elen Bart, Peters Jan, Botteldooren Dick, and Bernard De Baets. “Opportunistic mobile air pollution monitoring: a case study with city wardens in Antwerp.” Atmospheric Environment 141 (2016): 408–421. [Google Scholar]
  43. Van den Bossche J, De Baets B, Verwaeren J, Botteldooren D & Theunis J Development and evaluation of land use regression models for black carbon based on bicycle and pedestrian measurements in the urban environment. Environ. Model. Softw 99, 58–69 (2018). [Google Scholar]
  44. VoPham T, Hart JE, Laden F, & Chiang Y-Y (2018). Emerging trends in geospatial artificial intelligence (geoAI): Potential applications for environmental epidemiology. Environmental Health: A Global Access Science Source, 17(1), 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Weichenthal S, Ryswyk K Van, Goldstein A, Bagg S, Shekkarizfard M, & Hatzopoulou M (2016). A land use regression model for ambient ultrafine particles in Montreal, Canada: A comparison of linear regression and a machine learning approach. Environmental Research, 146, 65–72. [DOI] [PubMed] [Google Scholar]
  46. World Health Organization (May 2, 2018). 9 out of 10 people worldwide breathe polluted air, but more countries are taking action. Available at https://www.who.int/news-room/detail/02-05-2018-9-out-of-10-people-worldwide-breathe-polluted-airbut-more-countries-are-taking-action. Accessed September 16, 2018.
  47. Yang P, Yang YH, Zhou BB, Zomaya AY. A review of ensemble methods in bioinformatics. Current Bioinformatics, 5 (2010), pp. 296–308 [Google Scholar]
  48. Zikova N et al. Estimating Hourly Concentrations of PM2.5 across a Metropolitan Area Using Low-Cost Particle Monitors. Sensors 17, 1922 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES