Sequence-oriented sensitive analysis for PM2.5 exposure and risk assessment using interactive process mining

Eduardo Illueca Fernández; Carlos Fernández Llatas; Antonio Jesús Jara Valera; Jesualdo Tomás Fernández Breis; Fernando Seoane Martinez

doi:10.1371/journal.pone.0290372

. 2023 Aug 24;18(8):e0290372. doi: 10.1371/journal.pone.0290372

Sequence-oriented sensitive analysis for PM2.5 exposure and risk assessment using interactive process mining

Eduardo Illueca Fernández ^1,^3,^*, Carlos Fernández Llatas ^3,⁴, Antonio Jesús Jara Valera ², Jesualdo Tomás Fernández Breis ¹, Fernando Seoane Martinez ^3,^5,^6,⁷

Editor: Sathishkumar V E⁸

PMCID: PMC10449204 PMID: 37616197

Abstract

The World Health Organization has estimated that air pollution will be one of the most significant challenges related to the environment in the following years, and air quality monitoring and climate change mitigation actions have been promoted due to the Paris Agreement because of their impact on mortality risk. Thus, generating a methodology that supports experts in making decisions based on exposure data, identifying exposure-related activities, and proposing mitigation scenarios is essential. In this context, the emergence of Interactive Process Mining—a discipline that has progressed in the last years in healthcare—could help to develop a methodology based on human knowledge. For this reason, we propose a new methodology for a sequence-oriented sensitive analysis to identify the best activities and parameters to offer a mitigation policy. This methodology is innovative in the following points: i) we present in this paper the first application of Interactive Process Mining pollution personal exposure mitigation; ii) our solution reduces the computation cost and time of the traditional sensitive analysis; iii) the methodology is human-oriented in the sense that the process should be done with the environmental expert; and iv) our solution has been tested with synthetic data to explore the viability before the move to physical exposure measurements, taking the city of Valencia as the use case, and overcoming the difficulty of performing exposure measurements. This dataset has been generated with a model that considers the city of Valencia’s demographic and epidemiological statistics. We have demonstrated that the assessments done using sequence-oriented sensitive analysis can identify target activities. The proposed scenarios can improve the initial KPIs—in the best scenario; we reduce the population exposure by 18% and the relative risk by 12%. Consequently, our proposal could be used with real data in future steps, becoming an innovative point for air pollution mitigation and environmental improvement.

Introduction

Air pollution is the first health risk significantly affecting morbidity and mortality [1]. The particular importance is the effect of aerosol and particle matter (PM). Despite the considerable improvement in European air quality over the past decades, in 2019, approximately 74% of the urban population was still exposed to PM2.5 concentrations exceeding the World Health Organization (WHO) air quality guidelines for health and 70% of official air quality monitoring points in the European Union report values above these limits. These values are expected to increase due to the movement of the population to cities [2]. For this reason, it is estimated that particle matter belongs to the five highest mortality risk factors worldwide, causing 4.2 million premature deaths [3]. In addition, other studies show that more than half of the mortality burden can be attributed to exposure to suspended particles of aerodynamic diameter less than 2.5 μg/m³ (PM2.5) [4] and these particles can trigger adult vascular disease events [5].

In this context of increasing morbidity, it is a key point to estimate the cost of these risks. The health impact of a pollutant—such as PM2.5—is related to the population’s exposure to this pollutant. The exposure of a single individual could be easily computed as the sum of the product of time spent by a person in different environments and the time-averaged pollutant concentrations occurring in those locations [6]. Thus, population exposure is an aggregation of the exposure of all individuals who belong to those populations, and it is necessary to understand the health risk in specific population groups. To quantify this effect, integrated exposure-response functions estimate the relative risk (RR) associated with PM2.5. These functions are models that relate the exposure to the relative risk [7]. Thus, it is possible to deduce that there are two main indicators to evaluate the health impact of air pollution: i) the population exposure, which is homogeneous for all citizens and ii) the relative risk that could take into account other epidemiological factors and other health risks present in the population.

To calculate these indicators, it is necessary to define a model that considers two inputs: (1) the geospatially distributed PM2.5 concentration—or the relevant pollutant—and a geolocated list of the activities performed by each member of the population; and (2) the duration of each of these activities. This approach is employed by the EXPLUME model, which uses as inputs the simulations performed by the CHIMERE chemical transport model to simulate the geospatial distribution of PM2.5 concentration, and the list and duration of activities are based on data from the French Institute of Statistics [8] in the city of Paris. The main limitation of this study is that it does not integrate each individual’s activities. In other words, a list of activities is generated only for the aggregate population, and the activity’s effect on the overall scenario is not considered. In our view, this limits the model considerably and leads to a loss of information because it is not possible to track how pollution affects each individual, and it is not possible to compute individual KPIs. Thus, sequential data and sequence-oriented analysis are critical to overcoming these limitations.

Therefore, applying interactive process mining to the traces describing the activities can help to exploit this information more comprehensively [9]. We hypothesise that the application of interactive process mining will allow us to perform a sequence-oriented sensitivity analysis and identify the best scenarios for reducing the population’s exposure to PM2.5 and the associated risk more quickly and accurately. To test this hypothesis, a synthetic dataset will be generated with a methodology similar to that proposed by EXPLUME [8], using data from the Spanish Institute of Statistics (https://www.ine.es/) for the city of Valencia (Spain) and interactive process mining will be applied to select the best scenarios, using personal exposure and relative risk as KPIs. These indicators will then be recalculated for the proposed strategies, quantifying the improvements. The application of interactive process mining for exposures assessment is completely new in state of the art and has the following advantages: i) it allows to analyse and quantify the exposure of each individual in the population, as well as the activities they perform; ii) it generates knowledge in an understandable way for public authorities and citizens; and iii) it allows to exploit the new infrastructures related to Smart Cities, where information about citizens’ activities can be obtained through mobile applications and IoT devices [10].

The rest of the paper is structured as follows. First, the State of the Art section sets the theoretical background and the work performed in air quality modelling and exposure assessment, as well as the approaches to estimate relative risk. Then, the Materials and Methods section describes the following points: i) the methodology used for modelling and the strategy for the generate the population and sequence data; ii) the interactive process mining workflow and iii) the KPIs computation. From these simulations and computations, a set of Results is obtained that provides the effectiveness of each scenario. Then, these results and the feasibility of the proposed methodology are analysed in the Discussion section and compared with state of the art, obtaining the next steps for the implementation in a real pilot. Finally, the main conclusions and future challenges are presented in the Conclusions.

State of the art

In the last years, a large amount of research has been performed to assess the health impact of particle matter and its relationship with the exposure population. These impacts can classify into long-term effects and short-term effects. Particulate matter pollution originates from a wide range of anthropogenic and natural sources, and its characteristics can vary in size distributions, chemical composition, and other properties. The resulting health outcomes also vary substantially, depending on an individual’s target physiological system or organ [11]. For this reason, the health outcomes are diverse and not well-known. Some cohort studies reported an excess risk for all-cause and cardiovascular mortality due to long-term exposure to PM2.5 [12] and the exposure to particles results on immediate effects on respiratory admissions [13].

To quantify this effect on public health, it is necessary to determine the exposure of the population to the pollutant, for which two inputs are needed: i) the personal activities performed during an interval of time, and ii) the air quality in the environment where the activity is performed. Regarding the first point, the main approach in state of the art is using individual sensors to monitor the activities of the citizens and the output concentration of the pollutants [6]. However, collecting this activity data could be complex and should require the consent of the participants in the experiment. One alternative consists of asking participants to finish a diary questionnaire that includes time, activities, and location. This approach has been applied in the school context [14]. On the other hand, a more general strategy is to model this activity data using different strategies [8]. For instance, activity patterns—which specify how long a person stays in each microenvironment—were derived from the advancement of the Multinational Time Use Study MTUS, which combines more than a million diary days from over 70 randomly sampled national-scale surveys into a single standardised format. MTUS allows researchers to analyse time spent by people in various sorts of work and leisure activities over the last 55 years and across 30 countries [15].

On the other hand, these activities must be cross-checked with particle matter values. For this, a concentration value is needed for each of the geolocations of the activities. In that sense, developing air quality models has facilitated the generation of spatially continuous data. The most robust eulerian model in state of the art regarding air quality and particle matter is CHIMERE, which allows for obtaining a concentration value for each of the cells of a grid [16]. However, in the context of exposure calculations, a high resolution is required that classical models do not have. Two strategies are generally used to improve the resolution of air quality data: i) the use of Street Canyons models, which simplifies some of the equations [17], and ii) the downscaling approaches based on linear regression models [18]. These models also allow the estimation of particle trajectories, especially when combined with Lagrangian models such as HYSPLIT [19]. It should be noted that all these models have been validated in real contexts, such as the study carried out in the city of Milan [20].

However, the exposure is not enough to assess the health risk. For this reason, WHO and Europe recommended 2015 a set of linear concentration–response functions for the main air pollutants and related health outcomes [21]. These functions are currently widely used for health assessments, e.g. on a European scale by EEA. In contrast, it is currently widely debated what the optimal shape of the concentration–response functions is and whether there should be a threshold or lower limit and the use of the linear functions [11]. In consequence, and beyond these exposure-risk functions, more complex models have been developed, such as the Global Exposure Mortality Model (GEMM) functions. For PM 2.5, the non-accidental mortality generally follows a supralinear association at lower concentrations and a near-linear association at higher concentrations, showing that health impacts related to PM 2.5 exposure have been underestimated at both the global and regional scales [22]. This model has been used recently to estimate cardiovascular mortality, obtaining a value towards 790 000 premature deaths [23].

Nevertheless, the methods presented in the state of the art do not consider a fundamental factor in an exposure study: time. Each activity registered in this kind of dataset is associated with a timestamp and a duration. Therefore, the application of new technologies in the data science field can help deal with new challenges related to timestamp management. In this context, process mining represents a collection of tools, methods, techniques, algorithms, etc., that allows achieving a better understanding of the execution of a process by means of analyzing the operational execution data that is generated during the execution of the process [24]. This approach fits with the activity sequence analysis, and this study is the first contribution—to hour knowledge—of process mining for individual exposure assessment in state of the art.

Materials and methods

Fundamentals and basic concepts

The atmosphere, whether urban or remote areas, contains many aerosol particles suspended. From a physical point of view, particle matter (PM) is a mixture of solid particles and liquid droplets found in the air. Some particles, such as dust, dirt, soot, or smoke, are large or dark enough to be seen with the naked eye. Others are so small they can only be detected using an electron microscope. This wide size range can be appreciated by considering that the mass for one 10 μm is equivalent to the mass of one billion 10 nm particles. Thus, working with each particle as a single entity is difficult. It is necessary to work with particle populations characterised by a cumulative size distribution, defined as the particles that are smaller than or equal to this size range. From this cumulative distribution, the concepts PM10, PM2.5, and PM1 arise, which are particles smaller than or equal to 10 μm, 2.5 μm, and 1 μm, respectively [25].

This work focuses on the PM2.5 population—also called the PM2.5 fraction—because of the impact on health. A percentage of a particle population can pass through the alveolar barrier in the lungs and reach the blood torrent. This percentage of particles is the respirable fraction responsible for adverse health effects and premature deaths. The size of these respirable particles depends on their chemical composition and the corporal weight of the individual. The percentage of particles smaller than PM2.5 that can pass the alveolar barrier is close to 100%, showing the need to focus mitigation policies on this dangerous pollutant [26].

For this elaborate effective strategy, measuring the impact of a concentration of PM2.5 in the air is necessary. This is done by computing KPIs. The exposure is the closest to concentration, and it has been previously defined as the number of particles a citizen is exposed to in an interval of time. This amount of pollutant affects the individual and can be modelled as the effect of other drugs or toxic. In this sense, concentration-response functions act as a model that measures this effect in a concrete outcome—generally, this is a pathology. There are a lot of concentration-response functions that link PM2.5 with several health issues, but this work will focus on the effect of PM2.5 on premature mortality [21].

Synthetic data generation

The final objective is to generate a dataset with the activities performed by the citizens of a population throughout a working day and relate to each activity the individual exposure and the mortality relative risk. A model has been implemented based on the workflow defined in Fig 1 to generate this dataset, which is detailed in this section. The first step in data generation is the domain definition. In other words, this process involves assigning the perimeter of the use case, the spatial resolution and the locations of the buildings, workplaces, schools, etc. and their properties. This definition is done using the geoJSON standard [27], allowing the addition of georeferenced polygons and points. According to this format, two files are generated: i) the first one is the gridded domain divided into squared cells—with a resolution of 5 x 5 km—and ii) a file with the coordinates of the locations, buildings and residences where citizens could live. In addition to the georeferenced data, the (geoJSON) format allows adding other attributes to the polygons and points. For the cells, the most important attribute is the cell identifier. For the locations, the identifier and important metadata are generated as the location type and the I/O ratios. This last parameter is required for the personal exposure computation, so it is necessary to define these I/O ratios accurately. For this reason, we use the I/O ratios proposed in this paper for several buildings and activities [28].

Fig 1 — First, collected data from different sources is mapped in the domain geoJSON, generating gridded air quality data—after applying a dispersion model—and demographic data. This is combined with the dose-response functions from WHO to generate sequence activity data related with personal exposure and mortality relative risk.

Since the analysis will be performed with 2018 data, the air quality data is obtained from the historical data obtained from the air quality stations. This information is in open data format and accessible through the European Air Quality Portal API. The data for the year 2018 have been downloaded for the stations ES1181A, ES1911A, ES1239A, ES1926A, ES1884A, ES1619A and ES1912A. The raw data provides only air quality measurements for seven locations. Still, gridded data is required to compute the individual exposure—a value for each cell defined in the domain. To overcome this problem, a bilinear interpolation model (Eqs 1 and 2) is used to fill the cells with PM2.5 concentration values, using the following formula, where c_i is the PM2.5 concentration in the cell i; c_j is the PM2.5 concentration in the station j; w_j the weight of the station j; n is the number of stations—in this case, n = 7; d_ij is the distance between the cell i and the station j; and d_max is the maximum computed distance.

\begin{matrix} c_{i} = \sum_{j = 1}^{n} c_{j} * w_{j} \end{matrix}

(1)

\begin{matrix} w_{j} = \frac{1}{d (1 - d_{m a x})} \end{matrix}

(2)

The simulated sample population is composed of 15795 individuals, around 0.2% of the total population of Valencia. The reason to take this value is that the sample is not large for computational purposes but big enough to keep all the information that characterises the population. A unique identifier is assigned to each citizen, and information about age, sex, habits, and pathologies is generated for each one. In addition, each citizen is assigned to one of the residences—locations where a citizen could habit—defined in the geoJSON file and to one workplace. This job could be in the industry, services, agriculture, school, university, or another residence. Last, information about the preferred mean of transport—metro, car or by foot—are added as well as information and location of free-time activities. The statistics to generate this information are taken from the INE (Spanish Institute of Statistics).

The next step is the computation of the sequence activities. For each citizen, the sequence of daily activities has been modelled. This step is critical because it is necessary to know the activities’ duration to define the timestamp. In this context, two actions are prescribed: static and movements. For the first one, the computation is not complicated if we assume that all the people work 8 hours per day (6 hours for school and university) and other constant time for static activities. The durations are modified by adding some Gaussian noise to increase the variability. However, for movements, this computation is more complicated. It is necessary to calculate the distance between the start and end points and the time of the movements assuming that the velocity is 5.7 km/h by foot and 40 km/h by car or metro. To assess this sequence of activities, we propose a methodology similar to the one proposed by the EXPLUME model [8].

The exposure of a single individual is computed using as input the gridded air quality data and the activity sequences [6]. The approach proposed suggests calculating the exposure of the citizen i (E_i) as the summation of the product of the concentration of PM2.5 that the citizen is exposed in each daily activity and the time spent in each activity, according to Eq 3, where c_j is the average concentration at the location of the activity j, t_j is the activity time j and a is the number of activities realised during a day.

\begin{matrix} E_{i} = \sum_{j = 1}^{a} c_{j} * t_{j} \end{matrix}

(3)

However, there are special cases where this formula is not entirely accurate for calculating individual exposure. The most important—because of its relevance during the COVID-19 pandemic—is the use of face masks. In this case, one study [29] proposes the approximation proposed in Eq 4, which assumes a pre-correction to the concentration values before applying Eq 3. In Eq 4, c is the ambient concentration, c_mask is the corrected concentration, PF is the mask protection factor, and f_day is the fraction of the day used for sampling.

\begin{matrix} c_{m a s k} = (\frac{c}{P F} f_{d a y}) + (c (1 - f_{d a y})) \end{matrix}

(4)

The relative risk being minimised is the mortality relative risk due to PM2.5 exposure. In other words, we search to quantify the risk of death in the population related to the levels of observed PM2.5. This relative risk is composed of several factors. The first one is the relative risk due to only the exposure to pollutants, and the other components are the risk factors or other diseases that could increase the effect of the particles. The final relative risk is the combination of all these risks.

The relative risk assessment is performed using the AirQ+ software developed by the WHO. The program calculates the magnitude of several health effects associated with exposure to the most relevant air pollutants in a given population. AirQ+ has been validated recently in a Campaign in Tarragona [30] to measure long-term exposure to ambient particulate matter PM2.5. In concrete, this tool has been used to assess the exposure-response function for PM2.5 in Valencia. The following methodology consists in obtaining the historical PM2.5 data for the 2018 year in the seven air quality stations and computing the historical data for the mean of all the stations. These data are used as input for the AirQ+. Once the exposure-response function is obtained, the model is used to compute the relative risk (RR) related to the exposition and the activity. In addition, this mortality relative risk is modified by the factor risks of the individual—diseases, smoking, etc.—by combining all the relative risk that affects the individual. The epidemiology RR is calculated from the death rate of different conditions obtained from the Global Burden of Disease database [31].

Sequence-oriented sensitive analysis

The main goal of this work is to perform a sensitivity analysis to reduce the exposure and the health outcomes by taking action on the activities. Sequence analysis is essential for our proposed optimisation because this allows for avoiding population metrics that could be biased by the sampling process. Thus, it allows for performing analyses for different population sectors—for instance, analysing only the persons affected by cardiac issues—or even tracking and optimising the activities of a single individual [14]. In addition, the classical sensitive analysis should be performed by doing different simulations among the inputs to find the inputs that minimise a set of KPIs. The computational cost of this sensitive analysis could be high due to the number of input parameters to consider. Thus, we propose a new methodology to assess a sequence-oriented sensitive analysis using interactive process mining (IPM) in this paper.

Process mining is a methodology that uses timestamp information to create human-understandable views that explain a sequence of processes (log). This approach’s principal advantage is avoiding the black box effect in some models in classical machine learning and data science [24]. IPM results from applying Interactive Pattern Recognition methodologies to Process Mining technologies [32]. Interactivity is an essential feature of our semi-automated process discovery, which uses information from the event log and user expertise. That is one of the differential features of our approach. As a result, the sequence-oriented sensitive analysis is composed of the following steps: i) computation of the initial KPIs in the base scenario, ii) ANOVA analysis to identify differences between activities and which activities are related to a high exposure and a high-risk, iii) generation of the process map and significance analysis, iv) proposition of new scenarios and v) computation of the KPIs for the new scenarios and comparison with the initial KPIs.

Interactive process mining is a methodology that requires an iterative human validation by domain experts through the so-called data rodeo. In this way, the domain experts can i) iteratively define a process indicator according to the sensitive analysis goals, ii) analyse and validate the process indicator and iii) be trained in using the process indicator. This process is oriented by an interactive process indicator (IPI)—in this case, IPI can be personal exposure or mortality relative risk. Once an IPI is defined, the domain expert is ready to analyse the data using an interactive process mining tool. This process is repeated until reaching the target scenarios, which should be validated by computing the KPIs again and checking for achieving the optimisation goal, in this case, reducing personal exposure and relative risk. In this work, the involved stakeholders are Libelium as domain experts—an IoT company that provides sustainability impact assessment solutions in smart cities—and data scientists from Karolinska Institutet, ITACA-Sabien, and the University of Murcia. The interactive process mining workflow is summarised in Fig 2.

The interactive process mining workflow is semi-automatic and can be summarised in the following steps. First, the CSV ingestion into the PMApp is done. In this step, it is possible to define filters to remove observations that are not correct; next, the CSV file is transformed into an ECSV file, the proper format for the Process Discovery algorithm. In this work, we use the PALIA algorithm, implemented by the Institute of Information and Communication Technologies (ITACA) of the Universidad Politecnica de Valencia, Valencia, Spain. This algorithm is the most appropriate one for our goals, because it is based on activity-based possess mining and produces explainable process maps [33]. In addition, it performs better, in terms of efficacy, than other process mining algorithms such as heuristic miner [34] or genetic process mining [35]. Once the process map has been obtained, the transition probabilities and other relevant characteristics, such as the number of traces, are calculated. The last step is semi-automatic and should be performed according to expert guidelines. It consists of the computation of P-values according to the statics computed before and defining a new scenario.

PMApp is a tool for generating custom PM dashboards to visualise process data coming from health organisations and producing advanced process views provided by the IPIs to empower the analysis made by the health stakeholders. PMApp also enables the creation of interactive dashboards that respond to the selection of arrows and nodes by capturing GUI events. It also allows the user to create custom forms and algorithms for discovery, filters, enhancement maps, etc. In PMApp, it is possible to render maps that enhance the discovered model using colour gradients. With this feature, it is possible to generate specific maps highlighting situations that depend on a customised formulation and are represented by nodes [36]. PMApp implements the PALIA algorithm, which results in the traceability of all learning processes, so each activity is continuously associated with single events [37].

KPIs computation

The last part of the proposed methodology is the quantification of the improvement implied by the proposed scenarios based on numerical KPIs. The following metrics are proposed for this work: i) 24 hours population exposure, ii) Population Relative Risk, iii) Percentage of Risky Activities and iv) Time Spent in risky activities. These quantifiers are not fixed and can be reformulated depending on the context, but their definition is recommended to be agreed upon between the data scientist and the urban health and environment specialist.

Results

Dataset generation and initial KPIs

The model output implies the generation of two output datasets related to the input parameters stored in CSV. The first is the population dataset, which describes all the citizens in the domain. This information includes sex, age, epidemiological data, location of the residence place and the workplace. This dataset comprises 15795 individuals—close to 2% of Valencia’s population. On the other hand, the sequences dataset keeps the information of all the movements and activities of a labour day—using as concentration input the historical data of 2018-03-01. Over this sequence dataset, the previously described KPIs are calculated, obtaining a 24-hour Population Exposure of 118 μg * h/m³. In addition, the other indicators show a Population Relative Risk equal to 1.90; a percentage of risky activities of 56% and a time spent in hazardous activities of around 3.50 hours. These results are plotted in Fig 3.

Fig 3 — In concrete, 24 h population exposure; mortality relative risk; percentage of risky activities and time spent in risky activities.

ANOVA analysis

Once the CSV file is generated, with the log info about the sequence data, it is used to perform the ANOVA analysis. For the exposure, we obtain a F_value = 14846 that is associated with a P_value = 0. If we compare the means in a Tukey posthoc analysis [38], it is observed that the activities with high individual exposure are agriculture and free-time activities. In addition, other work activities like industry, services, or school—that suppose a huge amount of time—are related to high exposure. These values are detailed in Fig 4, where the ANOVA table is included, as well as a heatmap showing the exposure mean differences between activities.

Fig 4 — The ANOVA table is shown in addition to the heatmap with the exposure difference between activities and the plot of the Tukey test.

Regarding the population relative risk, the ANOVA analysis computes F_value = 739 and P_value = 0, showing that there are significant differences between the risk of activities, so it is possible to define risky and safe activities. The high-risk activities are agriculture, leisure, and running. A high correlation exists between the activities with high exposure and the activities with an increased relative risk. These results are exposed in Fig 5, which includes the ANOVA table, a heatmap that shows mean differences between activities and the Tukey posthoc analysis.

Fig 5 — The ANOVA table is shown in addition to the heatmap with the mortality relative risk difference between activities and the plot of the Tukey test.

Interactive Process Mining

The Interactive Process Mining workflow provides a process map that explains all the daily activities of the population Fig 6. The colour gradient in each node represents the time spent. In this line, the redder the node, the more time spent on the activity. This flow enhancement can illustrate the characteristics of daily activities. In that case, higher times are associated with work activities, around eight hours. Thus, as personal exposure is a function of time, it is possible to interpret that the activities with a high time are related to increased exposure. In contrast, greener activities could be associated with low exposure and risk. In this sense, according to the process map shown in Fig 5, the activities with the highest amount of time spent are staying at home and the work activities—agriculture, services, industry, residence (understood as people that work in another home) and agriculture—and studying activities—university, school and library. On the other hand, the activities that cover low quantities of time are the movement activities—by foot, private car and metro—because they depend on the distance, a parameter that in some cases could be small.

However, this process map is not enough to assess target activities. Therefore, it is necessary to compute the process map of the activities of the traces associated with high exposure (Fig 7). When the statistical significance is calculated—nodes highlighted in yellow -it is obtained that the nodes that show a significant difference in the time spent in the high exposure traces are stayed at home, moved by foot and moved by private car. In addition, the running node connected to the private car is also significant. Therefore, this significance assessment reveals a set of target activities. It is important to note that this analysis only shows significant differences, possibly caused by a higher or lower time spent on this activity. To clarify this, it is necessary to compare these Interactive Process Mining results with the results of the ANOVA analysis. According to this methodology, the following three scenarios are proposed: the implementation of an increment in the private car velocity, the improvement of the infrastructures of the residence buildings, the recommendation of safer places to exercise and the regulation of face masks in outdoor and public environments. Then, combinations of the best scenarios will be done to improve the KPIs.

Fig 7 — Highlighted nodes represent nodes with a significative difference in the spent time in comparison with the process map in Fig 6.

Scenario 1: Increase private car velocity

The application of the first scenario reaches an improvement in the KPIs, as shown in Fig 8. The 24-hour population exposure is reduced to 118 μg * h/m³, improving this KPI by 1%. In contrast, the population’s relative risk remains constant at 1.9. In addition, the percentage of risky activities is reduced to 54% and the number of hours spent on these unsafe activities is the same (3.51 h).

Scenario 2: Improving building infrastructures

The application of the second scenario achieves a considerable improvement in the KPIs, as shown in Fig 9. The 24-hour population exposure is reduced to 98 μg * h/m³, improving this value an 18%. This effect is also observed in the population relative risk, changing its value from 1.90 to 1.70, which means a 12% improvement. In addition, the percentage of risky activities is reduced to 54%, and the number of hours spent on these dangerous activities remains at 3.5.

Scenario 3: Recommend safe running places

The application of the third scenario achieves a slight improvement on three of the four KPIs, as it is represented in the gauges (Fig 10). 24-hour population exposure is reduced to 116 μg * h/m³, supposing a reduction of 3%. On the other hand, the population relative risk remains at 1.90. In addition, the percentage of risky activities remains at 56%, and the number of hours spent in unsafe activities also remains at 3.50.

Scenario 4: Use face masks in outdoor and public environments

This scenario proposes the regulation of using a face mask in outdoor activities and also in public buildings—as universities, schools, libraries and shopping centres. The implementation of these measurements achieves an important improvement in all KPIs, as it is represented in the gauges (Fig 11). 24-hour population exposure is reduced to 110 μg * h/m³, supposing a reduction of 8%. On the other hand, the population’s relative risk is reduced to 1.80. Last, the percentage of risky activities decreases to 48%, and the number of hours spent in unsafe activities is reduced to 3.50.

Scenario 5: Scenario 2 + Scenario 4

The last scenario proposes a combination of the best ones (Scenario 2 + Scenario 4), applying the face mask regulations and the improvement of buildings infrastructures to reduce I/O ratios. Applying this combination of measurements obtains the best results, as shown in Fig 12. 24-hour population exposure is reduced to 88 μg * h/m³, supposing a reduction of 26%. On the other hand, the population’s relative risk decreases a 16% to 1.60. In addition, the percentage of risky activities is reduced to 46%, and the number of hours spent in unsafe activities to 3.30.

Analysis of model variability

This section analyses the variability of the KPIs when several simulations are performed. This is because the sequences generated in each iteration are different when working with static models. These analyses have been carried out in triplicate for each of the proposed scenarios (Table 1). In general, little variability can be observed, especially in Scenario 2 and Scenario 5, where the same results are obtained for the three experiments. On the other hand, the most sensitive KPI is 24 h population exposure, followed by the Percentage of Risky Activities and Time Spent in Risky Activities. Finally, the Mortality Relative Risk does not change in the different simulations.

Table 1. KPIs for different iterations of the model.

	Population Exposure	Relative Risk	Risky Activities	Time Spent
Scenario 1	117 117 116	1.9 1.9 1.9	54 54 55	3.5 3.6 3.5
Scenario 2	96 96 96	1.7 1.7 1.7	55 55 55	3.5 3.5 3.5
Scenario 3	117 117 116	1.9 1.9 1.9	55 55 55	3.6 3.6 3.5
Scenario 4	110 109 110	1.8 1.8 1.8	48 47 48	3.4 3.3 3.3
Scenario 5	88 88 88	1.6 1.6 1.6	46 46 46	3.3 3.3 3.3

Open in a new tab

Discussion

The previous results demonstrate that Interactive Process Mining can identify target activities for exposure mitigation, showing that this methodology could perform a sequence-oriented sensitive analysis. This hypothesis gets more robust because the proposed scenarios reduce both population exposure and population risk—in the best case, the improvement of building infrastructures combined with a strong face mask policy reduces exposure by 26% and the mortality relative risk by 16%. The strong combination of these two scenarios is because of their impact in a separate ways. The improvement of building infrastructures helps to reduce population exposure by 18% and relative risk by 12%. In contrast, the use of face masks reduces these two KPIs by only 8% and 5%, but has a great impact on the percentage of risky activities—it is reduced a 14%. Thus, it is possible two conclude that the combinations of both scenarios exploit important synergies that allow for obtaining better KPIs.

These results support that Interactive Process Mining could become a powerful environmental modelling and exposure assessment tool. It is important to note that the scenario related to residence buildings is remarkably successful because of the high time spent at home, as it is shown in the process map, and it is a court node in the graph—all the traces pass by this activity. The other scenarios are less successful because they do not affect all individuals. The time spent and the percentages on risky activities do not change much since the time distribution of the activities—the time people spend in each one- does not change in our scenarios. These conclusions are like the ones obtained in the EXPLUME simulation [8].

This sequence-oriented sensitive analysis presents several advantages concerning the classical methodologies. The main strength is sequence analysis and sequence data, because experts in interactive process mining can do a track. The analysis without edges between nodes can be extracted for ANOVA results, and it is observed that they are similar, but the strength here is that the results are more explainable. For instance, by studying the sequence track it is possible to conclude that acting on building infrastructure is quite effective because it is a pattern that appears in all the sequences, and the same applies to studying less successful scenarios. Thus, the interactive process mining workflow is more agile, explainable, and human-oriented, avoiding the black box effect. Clearly, this new methodology shows the target parameters, and it is only necessary to iterate on these few inputs. On the other hand, the traditional approach iterates over all the parameters to find the best combination. However, this hypothesis should be tested by conducting a pilot with real data. This approach allows extrapolation of our method for forecasting purposes, using chemistry transport models to generate predictions in gridded domains [16], being possible to perform this analysis for other pollutants, as it is done in Paris [8]. Furthermore, the most critical point is that sequence-oriented sensitive analysis is a more intelligent approach because it is based on understandable models and human criteria. However, this methodology has some potential limitations: it is possible to miss some crucial parameters in the exposure contribution, and the outcomes are susceptible to wrong human interpretations.

In this paper, the results obtained for the population generated cannot be extrapolated to the actual population of Valencia. Logically, since the model is based on accurate statistics, the synthetic dataset must be representative and have some correlation. In this work, we can ensure that the data is representative of the overall population. We have used the most recent statistics, and since the number of individuals in our sample is high (100856 activities and 15795 individuals), we can assume that the distribution of the parameters is similar to the real one, according to the central limit theorem. On the other hand, dynamic data is generated using the activities and movement statistics for the city of Valencia—previous work has validated this methodology [8]. For this reason, the results presented in this paper should be interpreted as a proof of concept of the proposed methodology and mode of use. In other words, our study aims to demonstrate the validity of our method for further steps and studies.

However, this synthetic data is reliable to confirm our hypothesis. The first reason is observed in Table 1, where all the simulations compute the same values for the four KPIs in each scenario. There is a little variation in some decimals but no significant differences. This confirms that our method is robust despite it being stochastic, and the sample generated is representative of the same population. Of course, this population could be different from the real one, highlighting the necessity of a real pilot.

In this sense, the rise of wearable smartwatches that allows fitness and sleep activity monitoring has created a cultural phenomenon called the quantified self, whereby members of the general population voluntarily wear tracking devices that continuously log their data in exchange for potential improvements in the quality of life or physical performance. This paradigm can be applied to the real-time monitoring of air pollution. These sensors have been tested in the canarin project for medical purposes [39], enabling people to change their behaviour or avoid pollution. Several studies have demonstrated that citizen engagement against air pollution is low, close to 12% [40]. Thus, the next step is to apply Interactive Process Mining to real data and obtain conclusions for the real population. The emergence of the IoT paradigm allows the creation of architectures in which it is easy to integrate all these processes (Fig 13). In addition, some pilots that combine mobile monitoring and modelling have demonstrated that they can reduce costs and improve accuracy [14].

Last, Interactive Process Mining requires multidisciplinary since the generated models and process maps must be validated by environmental experts. This fact has already been observed in Interactive Process Mining applied to healthcare, where the role of data rodeos in coordination with clinicians is essential [37]. In this sense, a similar methodology should be applied to the sequence-oriented sensitive analysis. As a result, the sequence-oriented sensitive analysis is a paradigm that involves urban planning professionals in the middle of understanding the process until a reliable KPI is defined, which can be computed from the data available in the system and through iterative data rodeos sessions. In addition, the expert should participate in the process map interpretation and scenario assessment. This last step is critical because the effect on the KPIs could be missed without defining a good scenario.

Conclusion

In conclusion, this work has developed a methodology based on Interactive Process Mining to perform a sequence-oriented sensitive analysis, demonstrating that the use of Interactive Process Mining substantially improves the identification of target activities related to high exposure and allows experts to take the best urban policies. Moreover, the results were obtained concerning the KPIs, with a reduction of 26% in the PM2.5 exposure and a decrease of 16% in the mortality relative risk in the best scenario. Thus, the results suggest that our initial hypothesis is correct, and the sequence-oriented sensitive analysis could be a powerful tool to improve the environment in cities.

On the other hand, the results obtained open up new research possibilities, which can be classified along with three different aspects: i) testing the methodology with real exposure data, seeking to understand which activities are related to real exposure, and using the architectures and infrastructures of the emerging IoT paradigm; ii) integrating urban experts in the Interactive Process Mining workflow, in the definition of KPIs as well as in the suggestion of scenarios and iii) improving the workflow adding new test and analysis that can complement the results. On the other hand, the use of the methodology proposed allows progress in the following lines that have been limited so far in state of the art: i) the assessment of public health policies to reduce population exposure; ii) the improvement of the knowledge in the relation of exposure and risk, enabling researchers to define accurate exposure-response models and iii) the applications of Interactive Process Mining in environmental modelling and air pollution mitigation.

Supporting information

S1 Data. Air quality data.

Gridded Air Quality Data for the city of Valencia. The grid is divided in several cells, in which one is computed the PM2.5 concentration. It is composed by the following variables: cell_id (the identifier for each cell in the grid), PM2.5 (PM2.5 concentration in μg/m₃), datetime (date in utc format), hourId and dayId. File: ./data/aq_data/valencia_aq_gridded_v2.csv.

(CSV)

Click here for additional data file.^{(40.4MB, csv)}

S2 Data. Population data.

Data with each one of the individuals in the populations, as well as the epidemiological and social data. It is composed by the following variables: id (citizen identifier), sex, age, obesity, diabetes, asthma, high blood pressure, pulmonar disease, heart disease, anxiety, smoke, alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), employ, transport, freeTime, residence (indicate the identifier of the employ, transport, freeTime and residence locations in the geoJSON file). File: ./data/population_data/sample_population_v3.csv.

(CSV)

Click here for additional data file.^{(1.1MB, csv)}

S3 Data. Base scenario sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the base population. It is composed by the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6.csv.

(CSV)

Click here for additional data file.^{(19.6MB, csv)}

S4 Data. Scenario 1 sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the first scenario population. It is composed of the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg/m₃), exposure (PM2.5 expsure in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6_car.csv.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S5 Data. Scenario 2 sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the second scenario population. It is composed of the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg/m₃), exposure (PM2.5 expsure in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6_residence.csv.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S6 Data. Scenario 3 sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the third scenario population. It is composed by the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg/m₃), exposure (PM2.5 expsure in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6_running.csv.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S7 Data. Scenario 4 sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the fourth scenario population. It is composed by the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg/m₃), exposure (PM2.5 expsure in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6_mask.csv.

(CSV)

Click here for additional data file.^{(19.4MB, csv)}

S8 Data. Scenario 5 sequences.

Data with the activities performed by citizens in a day, with the corresponding locations, exposure and risk for the fifth scenario population. It is composed by the following variables: id (citizen identifier), cell_id (the identifier for each cell in the grid, activities related to two or more cells are coding as the concatenation of the cell_id of all the included cells), order, Activity, Duration, DateStart, DateEnd, Sex, Age, Asthma, Diabetes, High Blood Pressure, Pulmonar Disease, Heart Disease, Anxiety, Smoke, Alcohol (binary epidemiological variables, if the value is 1 the citizen suffer the affection), concentration (PM2.5 concentration in μg/m₃), exposure (PM2.5 expsure in μg * min/m₃), level, mortality relative risk. File: ./data/sequence_data/sintetic_data_v6_scenario5.csv.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

Acknowledgments

Code availability

Source codes are available on the GitHub repository (https://doi.org/10.5281/zenodo.8079155). This includes the scripts to generate the synthetic data and compute the KPIs. The PMApp tool has been developed in the context of the VALUE project. It should be requested from the authors (https://valueproject.eu/)

Data Availability

All data and code files are available from the Zenodo repository (http://doi.org/10.5281/zenodo.8079155).

Funding Statement

The author EIF has received funded from Fundacion Séneca (https://fseneca.es/), grant number 21300/FPI/19 The authors have received funded from EIT Health (https://eithealth.eu/), grant number 220649 The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Organization WH, et al. Review of evidence on health aspects of air pollution: REVIHAAP project: technical report. World Health Organization. Regional Office for Europe; 2021. [PubMed]
2. Ortiz A, Guerreiro C, Soares J, et al. Air quality in Europe-2020 report. Europea n Environment Agency. 2020; p. 164–164. [Google Scholar]
3. Cohen AJ, Brauer M, Burnett R, Anderson HR, Frostad J, Estep K, et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. The lancet. 2017;389(10082):1907–1918. doi: 10.1016/S0140-6736(17)30505-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Pascal M, de Crouy Chanel P, Wagner V, Corso M, Tillier C, Bentayeb M, et al. The mortality impacts of fine particles in France. Science of the Total Environment. 2016;571:416–425. doi: 10.1016/j.scitotenv.2016.06.213 [DOI] [PubMed] [Google Scholar]
5. Wu PC, Cheng TJ, Kuo CP, Fu JS, Lai HC, Chiu TY, et al. Transient risk of ambient fine particulate matter on hourly cardiovascular events in Tainan City, Taiwan. PloS one. 2020;15(8):e0238082. doi: 10.1371/journal.pone.0238082 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Klepeis NE. Modeling human exposure to air pollution. Human exposure analysis. 2006; p. 445–470. doi: 10.1201/9781420012637.ch19 [DOI] [Google Scholar]
7. Limaye VS, Schöpp W, Amann M. Applying integrated exposure-response functions to PM2. 5 pollution in India. International Journal of Environmental Research and Public Health. 2019;16(1):60. doi: 10.3390/ijerph16010060 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Valari M, Markakis K, Powaga E, Collignan B, Perrussel O. EXPLUME v1. 0: a model for personal exposure to ambient O3 and PM2.5. Geoscientific Model Development. 2020;13(3):1075–1094. doi: 10.5194/gmd-13-1075-2020 [DOI] [Google Scholar]
9. Lull JJ, Bayo JL, Shirali M, Ghassemian M, Fernandez-Llatas C. Interactive process mining in iot and human behaviour modelling. Interactive process mining in healthcare. 2021; p. 217–231. [Google Scholar]
10. Sun Y, Song H, Jara AJ, Bie R. Internet of things and big data analytics for smart and connected communities. IEEE access. 2016;4:766–773. doi: 10.1109/ACCESS.2016.2529723 [DOI] [Google Scholar]
11. Sokhi RS, Moussiopoulos N, Baklanov A, Bartzis J, Coll I, Finardi S, et al. Advances in air quality research–current and emerging challenges. Atmospheric chemistry and physics. 2022;22(7):4615–4703. doi: 10.5194/acp-22-4615-2022 [DOI] [Google Scholar]
12. Hoek G, Krishnan RM, Beelen R, Peters A, Ostro B, Brunekreef B, et al. Long-term air pollution exposure and cardio-respiratory mortality: a review. Environmental health. 2013;12(1):1–16. doi: 10.1186/1476-069X-12-43 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Sofwan NM, Mahiyuddin WRW, Latif MT, Ayub NA, Yatim ANM, Mohtar AAA, et al. Risks of exposure to ambient air pollutants on the admission of respiratory and cardiovascular diseases in Kuala Lumpur. Sustainable Cities and Society. 2021;75:103390. doi: 10.1016/j.scs.2021.103390 [DOI] [Google Scholar]
14. Zhang C, Hu Y, Adams MD, Bu R, Xiong Z, Liu M, et al. Distribution patterns and influencing factors of population exposure risk to particulate matters based on cell phone signaling data. Sustainable Cities and Society. 2023;89:104346. doi: 10.1016/j.scs.2022.104346 [DOI] [Google Scholar]
15. Fisher K, Gershuny J, Gauthier A. Multinational time use study: user’s guide and documentation. Centre for Time Use Research, University of Oxford. 2012;. [Google Scholar]
16. Menut L, Bessagnet B, Khvorostyanov D, Beekmann M, Blond N, Colette A, et al. CHIMERE 2013: a model for regional atmospheric composition modelling. Geoscientific model development. 2013;6(4):981–1028. doi: 10.5194/gmd-6-981-2013 [DOI] [Google Scholar]
17. Miao C, Yu S, Hu Y, Bu R, Qi L, He X, et al. How the morphology of urban street canyons affects suspended particulate matter concentration at the pedestrian level: An in-situ investigation. Sustainable Cities and Society. 2020;55:102042. doi: 10.1016/j.scs.2020.102042 [DOI] [Google Scholar]
18. Bessagnet B, Couvidat F, Lemaire V. A statistical physics approach to perform fast highly-resolved air quality simulations–A new step towards the meta-modelling of chemistry transport models. Environmental Modelling & Software. 2019;116:100–109. doi: 10.1016/j.envsoft.2019.02.017 [DOI] [Google Scholar]
19. Wang F, Chen D, Cheng S, Li J, Li M, Ren Z. Identification of regional atmospheric PM10 transport pathways using HYSPLIT, MM5-CMAQ and synoptic pressure pattern analysis. Environmental Modelling & Software. 2010;25(8):927–934. doi: 10.1016/j.envsoft.2010.02.004 [DOI] [Google Scholar]
20. Silibello C, Calori G, Brusasca G, Giudici A, Angelino E, Fossati G, et al. Modelling of PM10 concentrations over Milano urban area using two aerosol modules. Environmental Modelling & Software. 2008;23(3):333–343. doi: 10.1016/j.envsoft.2007.04.002 [DOI] [Google Scholar]
21. Héroux ME, Anderson HR, Atkinson R, Brunekreef B, Cohen A, Forastiere F, et al. Quantifying the health impacts of ambient air pollutants: recommendations of a WHO/Europe project. International journal of public health. 2015;60:619–627. doi: 10.1007/s00038-015-0690-y [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Burnett R, Chen H, Szyszkowicz M, Fann N, Hubbell B, Pope CA III, et al. Global estimates of mortality associated with long-term exposure to outdoor fine particulate matter. Proceedings of the National Academy of Sciences. 2018;115(38):9592–9597. doi: 10.1073/pnas.1803222115 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Lelieveld J, Klingmüller K, Pozzer A, Pöschl U, Fnais M, Daiber A, et al. Cardiovascular disease burden from ambient air pollution in Europe reassessed using novel hazard ratio functions. European heart journal. 2019;40(20):1590–1596. doi: 10.1093/eurheartj/ehz135 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Van Der Aalst W. Process mining: data science in action. Springer; 2016. [Google Scholar]
25. Seinfeld JH, Pandis SN. Properties of atmospheric aerosols Atmospheric Chemistry and Physics: From Air Pollution to Climate Change. 2006. [Google Scholar]
26. Görner P, Simon X, Bémer D, Lidés G. Workplace aerosol mass concentration measurement using optical particle counters. Journal of Environmental Monitoring. 2011; pp 310–317. [DOI] [PubMed] [Google Scholar]
27.Butler H, Daly M, Doyle A, Gillies S, Hagen S, Schaub T. The geojson format; 2016. https://www.rfc-editor.org/rfc/rfc7946
28. Guak S, Lee SG, An J, Lee H, Lee K. A model for population exposure to PM2. 5: Identification of determinants for high population exposure in Seoul. Environmental Pollution. 2021;285:117406. doi: 10.1016/j.envpol.2021.117406 [DOI] [PubMed] [Google Scholar]
29. Kodros JK, O’Dell K, Samet JM, L’Orange C, Pierce JR, Volckens J. Quantifying the health benefits of face masks and respirators to mitigate exposure to severe air pollution. GeoHealth. 2021;5(9):e2021GH000482. doi: 10.1029/2021GH000482 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Rovira J, Domingo JL, Schuhmacher M. Air quality, health impacts and burden of disease due to air pollution (PM10, PM2. 5, NO2 and O3): Application of AirQ+ model to the Camp de Tarragona County (Catalonia, Spain). Science of the total environment. 2020;703:135538. doi: 10.1016/j.scitotenv.2019.135538 [DOI] [PubMed] [Google Scholar]
31. Murray CJ, Abbafati C, Abbas KM, Abbasi M, Abbasi-Kangevari M, Abd-Allah F et al. Five insights from the global burden of disease study 2019 The Lancet. 2020; p. 1135–1159. doi: 10.1016/S0140-6736(20)31404-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Fernandez-Llatas C. Interactive process mining in healthcare. Springer; 2021. [Google Scholar]
33. Fernandez-Llatas C, Valdivieso B, Traver V, Benedi JM. Using process mining for automatic support of clinical pathways design. Data mining in clinical medicine. 2015; p. 79–88. doi: 10.1007/978-1-4939-1985-7_5 [DOI] [PubMed] [Google Scholar]
34. Weijters AJJM, Ribeiro JTS. Flexible heuristics miner (FHM). In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). 2011; pp 310–317. doi: 10.1109/CIDM.2011.5949453 [DOI] [Google Scholar]
35. de Medeiros AKA, Weijters AJMM, van der Aalst WMP. Genetic process mining: an experimental evaluation Data Mining and Knowledge Discovery. 2007; pp 245–304. [Google Scholar]
36. Ibanez-Sanchez G, Fernandez-Llatas C, Martinez-Millana A, Celda A, Mandingorra J, Aparici-Tortajada L, et al. Toward value-based healthcare through interactive process mining in emergency rooms: the stroke case. International journal of environmental research and public health. 2019;16(10):1783. doi: 10.3390/ijerph16101783 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Fernandez-Llatas C, Lizondo A, Monton E, Benedi JM, Traver V. Process mining methodology for health process tracking using real-time indoor location systems. Sensors. 2015;15(12):29821–29840. doi: 10.3390/s151229769 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949; p. 99–114. doi: 10.2307/3001913 [DOI] [PubMed] [Google Scholar]
39. Dessimond B, Annesi-Maesano I, Pepin JL, Srairi S, Pau G. Academically produced air pollution sensors for personal exposure assessment: The canarin project. Sensors. 2021;21(5):1876. doi: 10.3390/s21051876 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Wells EM, Dearborn DG, Jackson LW. Activity change in response to bad air quality, National Health and Nutrition Examination Survey, 2007–2010. PloS one. 2012;7(11):e50526. doi: 10.1371/journal.pone.0050526 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0290372.r001

Decision Letter 0

Sathishkumar V E

30 May 2023

PONE-D-23-03914Sequence-Oriented Sensitive Analysis for PM2.5 exposure and risk assessment using Interactive Process MiningPLOS ONE

Dear Dr. Illueca Fernandez,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 14 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

This work has been supported by the fellowship 21300/FPI/19 funded by Fundaci´on 556

S´eneca and co-funded by HOP Ubiquitous S.L. Regi´on de Murcia (Spain), grant Nº 557

21681/EFPI/21. This activity has received funding from EIT Health (www.eithealth.eu) 558

ID 220649, the innovation community on Health of the European Institute of Innovation 559

and Technology (EIT), a body of the European Union, under Horizon 2020, the EU 560

Framework Programme for Research and Innovation.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

The author EIF has received funded from Fundacion Séneca (https://fseneca.es/), grant number 21300/FPI/19

The authors have received funded from EIT Health (https://eithealth.eu/), grant number 220649

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present an approach to identify exposure-related activities in the context of PM2.5. The approach is sequence-oriented and based, among other things, on the use of Interactive Process Mining. The paper begins with an introduction to PM2.5, the health effects and current approaches and studies PM2.5, the concentration of PM2.5 at specific locations, and the limitation of previous studies because they do not include the factor of time. The authors then explain the state of the art in the area of health impact of particle matter and its relationship with the exposure population. After that, the authors describe their procedure, which includes the generation of the synthetic data on which the evaluation will be based later, the actual sequence-oriented sensitive analysis and the calculation of the KPIs. At the end, the data set, the ANOVA analysis and the Interactive Process Mining and the results are described.

The paper is very well written and the authors have a lot of knowledge in the topic PM2.5 and the state of the art. However, the paper has some weaknesses, the correction of which would improve the paper significantly.

(1) The role of sequence in the analysis is not made clear. What are the implications of sequence and sequential consideration of activities for the overall goal? What are the advantages of sequential representation? Would color highlighting in this way, without edges between nodes lead to similar results?

(2) If I understood correctly, interactive process mining was performed. This always involves experts who evaluate the process models? It is not clear how exactly the evaluation was done. Who carried out the interactive process mining? How was it evaluated? Who derived the suggestions? How was the experiment conducted?

(3) Basics: the understanding of the paper could be significantly supported by explaining basics (either in a separate basics chapter or in an existing chapter).

Minor issues and open questions:

- Figures are partially pixelated and not readable

- The figures should be better described. Percentages at edges in process models were not clear.

- Why do activities appear multiple times in the process models (industry, for example)?

- What were the reasons that led to the selection of the PALIA algorithm?

- A figure describing the process of application and evaluation would facilitate understanding.

- Typo in "oriented sensitive analysis" -> sequention-oriented? semiautomatic -> semi-automatic? adition -> addition? Partially inconsistent British/American English. expsure

It is very interesting to bring techniques from process mining into other domains and possibly (so far) not typical application areas.

Reviewer #2: Good work done for the benefit of the environment using a new tool with potential benefits for different local systems. Very interesting document with important data, adequate statistical validation but presented in a very understandable way. Congratulations.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Aug 24;18(8):e0290372. doi: 10.1371/journal.pone.0290372.r002

Author response to Decision Letter 0

2 Jul 2023

We thank the reviewers for their valuable comments, which have contributed to improving the quality of the manuscript. The suggestions and comments have been closely followed, and revisions have been made accordingly, which we hope meets your expectations.

Please, find below the comments of the Reviewers and our responses inserted after each comment. The additions to the manuscript text have been highlighted in italic letters because more text is provided to keep the context. Please note that figure, paragraph, and line numbers refer to the manuscript with track changes, not the original one. The response to the editorial comments is also included at the end of the response.

Reviewer #1

Q1. The role of sequence in the analysis is not made clear. What are the implications of sequence and sequential consideration of activities for the overall goal? What are the advantages of sequential representation? Would color highlighting in this way, without edges between nodes lead to similar results?

R1. We agree with the reviewer that the role of sequence in the analysis could be explained in more detail. The key point is that sequence analysis considers each activity's contribution and allows one to compute the exposure per citizen. This allows for avoiding population metrics that could be biased by the sampling process - this can also happen in sequential analysis. Still, experts in interactive process mining can do a track. The analysis “without edges between nodes” can be extracted for ANOVA results, and it is observed that they are similar, but the strength here is that the results are more explainable. Thus, the interactive process mining workflow is more agile.

To clarify this, the following sentences are added to the introduction.

“In our view, this limits the model considerably and leads to a loss of information because it is not possible to track how pollution affects each individual, and it is not possible to compute individual KPIs. Thus, sequential data and sequence-oriented analysis are critical to overcome these limitations. ”

The following explanation has been added to the Sequence-Oriented Sensitive Analysis subsection, second paragraph.

“The main goal of this work is to perform a sensitivity analysis to reduce the exposure and the health outcomes by taking action on the activities. In other words, the main goal is to identify which activities contribute more to population exposure. Sequence analysis is essential for our proposed optimisation because this allows for avoiding population metrics that the sampling process could bias. Thus, it allows for performing analyses for different population sectors - for instance, analysing only the persons affected by cardiac issues - or even tracking and optimising the activities of a single individual [1]. In addition, the classical sensitive analysis should be performed by doing different simulations among the inputs to find the inputs that minimise a set of KPIs. The computational cost of this sensitive analysis could be high due to the number of input parameters to consider. Thus, we propose a new methodology to assess an oriented sensitive analysis using interactive process mining (IPM) in this paper.”

Last, the following contribution was added to the discussion, in concrete, in the third paragraph.

“This sequence-oriented sensitive analysis presents several advantages concerning the classical methodologies. The main strength is sequence analysis and sequence data because experts in interactive process mining can do a track. The analysis without edges between nodes can be extracted for ANOVA results, and it is observed that they are similar, but the strength here is that the results are more explainable. For instance, by studying the sequence track it is possible to conclude that acting on building infrastructure is quite effective because it is a pattern that appears in all the sequences. The same applies to studying less successful scenarios. Thus, the interactive process mining workflow is more agile, explainable, and human-oriented, avoiding the black box effect. Clearly, this new methodology shows the target parameters, and it is only necessary to iterate on these few inputs. On the other hand, the traditional approach iterates over all the parameters to find the best combination. However, this hypothesis should be tested by conducting a pilot with real data. This approach allows extrapolation of our method for forecasting purposes, using chemistry transport models to generate predictions in gridded domains [2], being possible to perform this analysis for other pollutants, as it is done in Paris [3]. Furthermore, the most critical point is that sequence-oriented sensitive analysis is a more intelligent approach based on understandable models and human criteria. However, this methodology has some potential limitations: it is possible to miss some crucial parameters in the exposure contribution, and the outcomes are susceptible to wrong human interpretations.”

[1] Zhang L, Guo C, Jia X, Xu H, Pan M, et al. (2018) Personal exposure measurements of school-children to fine particulate matter (PM2.5) in winter of 2013, Shanghai, China. PLOS ONE 13(4): e0193586. https://doi.org/10.1371/journal.pone.0193586

[2] Menut L, Bessagnet B, Khvorostyanov D, Beekmann M, Blond N, Colette A, et al. CHIMERE 2013: a model for regional atmospheric composition modelling. Geoscientific model development. 2013;6(4):981–1028.

[3] Valari M, Markakis K, Powaga E, Collignan B, Perrussel O. EXPLUME v1. 0: a model for personal exposure to ambient O3 and PM2.5. Geoscientific Model Development. 2020;13(3):1075–1094.

Q2. If I understood correctly, interactive process mining was performed. This always involves experts who evaluate the process models? It is not clear how exactly the evaluation was done. Who carried out the interactive process mining? How was it evaluated? Who derived the suggestions? How was the experiment conducted?

R2. We agree with the reviewer that interactive process mining always involves an evaluation from a team. This has been done indeed in this work. The approach has been validated with urban planners experts from Libelium, an IoT company that works on sustainability impact assessment in Smart Cities. The interactive process mining has been performed by Eduardo Illueca Fernández (UM) and Carlos Fernández Llatas (ITACA-SABIEN), supervised by Fernando Seoane Martinez (KI). Antonio Jesús Jara Valera (Libelium) derived the suggestions for the target scenario as an expert in sustainability impact assessment, and the validation has been performed by computing the KPIs again and obtaining improvements.

To clarify this, the following paragraph (third paragraph) and Figure 2 have been added to the manuscript.

“Interactive process mining is a methodology that requires an iterative human validation by domain experts through the so-called data rodeo. In this way, the domain experts can i) iteratively define a process indicator according to the sensitive analysis goals, ii) analyse and validate the process indicator and iii) be trained in using the process indicator. This process is oriented by an interactive process indicator (IPI) - in this case, IPI can be personal exposure or mortality relative risk. Once an IPI is defined, the domain expert is ready to analyse the data using an interactive process mining tool. This process is repeated until reaching the target scenarios, which should be validated by computing the KPIs again and checking for achieving the optimisation goal, in this case, reducing personal exposure and relative risk. In this work, the involved stakeholders are Libelium as domain experts - an IoT company that provides sustainability impact assessment solutions in smart cities - and data scientists from Karolinska Institutet, ITACA-Sabien, and the University of Murcia. The interactive process mining workflow is summarised in Figure 2. [4] ”

Figure 2. The first step of the IP is the preparation phase, where the simulations are performed; then, the generated data is processed through interactive process mining in constant discussion with domains expert, and the proposed scenarios are validated with domain experts once the KPIs are calculated again.

[4] Lull JJ, Bayo JL, Shirali M, Ghassemian M, Fernandez-Llatas C. Interactive process mining in IOT and human behaviour modelling. Interactive process mining in healthcare. 2021; p. 217–231

Q3. Basics: the understanding of the paper could be significantly supported by explaining basics (either in a separate basics chapter or in an existing chapter).

R3. We agree with the reviewer that explaining basic concepts can help readers understand the work. For this reason, we have added a first subsection in the Materials and Methods chapter, Fundamentals and Basic Concepts, that we attach hereafter.

“The atmosphere, whether urban or remote areas, contains many aerosol particles suspended. From a physical point of view, particle matter (PM) is a mixture of solid particles and liquid droplets found in the air. Some particles, such as dust, dirt, soot, or smoke, are large or dark enough to be seen with the naked eye. Others are so small they can only be detected using an electron microscope. This wide size range can be appreciated by considering that the mass for one 10 μm is equivalent to the mass of one billion 10 nm particles. Thus, working with each particle as a single entity is difficult. It is necessary to work with particle populations characterised by a cumulative size distribution, defined as the particles that are smaller than or equal to this size range. From this cumulative distribution, the concepts PM10, PM2.5, and PM1 arise, which are particles smaller than or equal to 10 μm, 2.5 μm, and 1 μm, respectively [5].

This work focuses on the PM2.5 population - also called the PM2.5 fraction - because of the impact on health. A percentage of a particle population can pass through the alveolar barrier in the lungs and reach the blood torrent. This percentage of particles is the respirable fraction responsible for adverse health effects and premature deaths. The size of these respirable particles depends on their chemical composition and the corporal weight of the individual. The percentage of particles smaller than PM2.5 that can pass the alveolar barrier is close to 100 %, showing the need to focus mitigation policies on this dangerous pollutant [6].

For this elaborate effective strategy, it is necessary to measure the impact of a concentration of PM2.5 in the air. This is done by computing KPIs. The exposure is the closest to concentration, and it has been previously defined as the number of particles a citizen is exposed to in an interval of time. This amount of pollutant affects the individual and can be modelled as the effect of other drugs or toxic. In this sense, concentration-response functions act as a model that measures this effect in a concrete outcome - generally, this is a pathology. There are a lot of concentration-response functions that link PM2.5 with several health issues, but this work will focus on the effect of PM2.5 on premature mortality [7]. ”

[5] Seinfeld JH, Pandis SN. Properties of atmospheric aerosols. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change. 2006.

[6] Görner P, Simon X, Bémer D, Lidés G. Workplace aerosol mass concentration measurement using optical particle counters. Journal of Environmental Monitoring. 2011; pp 310–317.

[7] Heroux ME, Anderson HR, Atkinson R, Brunekreef B, Cohen A, Forastiere F, et al. Quantifying the health impacts of ambient air pollutants: recommendations of a WHO/Europe project. International journal of public health. 2015;60:619–627.

Q4. Figures are partially pixelated and not readable

R4. We would like to thank the reviewer for highlighting this, and all the figures in the manuscript have been re-edited and modified to guarantee high-quality resolution by checking with Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/, suggested by PLOS ONE. We also noted that uploading figures to supplementary materials, the files for Figure 5 and Figure 6 were duplicated (Figure 6 was uploaded twice), and we have corrected this now.

Q5. The figures should be better described. Percentages at edges in process models were not clear.

R5. We agree with the reviewer that a more exhaustive description could be added to the figures. The manuscript has been updated with these extended captions. Percentages at edges in process maps were removed to improve the quality of the Figure.

Q6. What were the reasons that led to the selection of the PALIA algorithm?

R6. We agree with the reviewer that the reasons for selecting PALIA are not justified in the manuscript. In general terms, we chose the PALIA algorithm after a bibliographic analysis, and we found that this algorithm fits better with the work’s goals. Some sentences have been added in the fourth paragraph in Sequence-Oriented Sensitive Analysis subsection to clarify this:

“In this work, we use the PALIA algorithm, implemented by the Institute of Information and Communication Technologies (ITACA) of the Universidad Politecnica de Valencia, Valencia, Spain. This algorithm is the most appropriate one for our goals, because it is based on activity-based possess mining and produces explainable process maps [8]. In addition, it performs better, in terms of efficacy, than other process mining algorithms, such as heuristic miner [9] or genetic process mining [10]. ”

[8] Fernandez-Llatas C, Valdivieso B, Traver V, Benedi JM. Using process mining for automatic support of clinical pathways design. Data mining in clinical medicine. 2015; p. 79–88.

[9] Weijters AJMM, Ribeiro JTS. Flexible heuristics miner (FHM). In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). 2011; pp 310–317.

[10] de Medeiros AKA, Weijters AJMM, van der Aalst WMP. Genetic process mining: an experimental evaluation. Data Min Knowl Discov. 2007; 14(2):245–304

Q7. A figure describing the process of application and evaluation would facilitate understanding.

R7. We agree with the reviewer that an additional figure will clarify the process of application and evaluation of interactive process mining. For this reason, we have added a new Figure 2 in Sequence-Oriented Sensitive Analysis subsection, as it is explained in more detail in Q1.2.

Q8. Typo in "oriented sensitive analysis" -> sequention-oriented? semiautomatic -> semi-automatic? adition -> addition? Partially inconsistent British/American English. expsure

R8. We thank the reviewer for highlighting these typos, which have been corrected to ensure consistent language in the manuscript.

Reviewer #2

This reviewer did not ask for changes in the manuscript.

Editorial comments

QE1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

RE1. We would like to thank the editor for highlighting the importance of format requirements. We have reviewed the format of our manuscript carefully and applied changes when necessary.

QE2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

The author EIF has received funded from Fundacion Séneca (https://fseneca.es/), grant number 21300/FPI/19. The authors have received funded from EIT Health (https://eithealth.eu/), grant number 220649. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

RE2. We would like to thank the editor for noting this, and we have removed the acknowledgement section and moved the funding information to the cover letter, according to your instructions. The following lines have been added to the cover letter.

The author EIF has received funding from Fundacion Séneca (https://fseneca.es/), grant number 21300/FPI/19. The authors have received funding from EIT Health (https://eithealth.eu/), grant number 220649. The funders had no role in study design, data collection and analysis, the decision to publish, or the preparation of the manuscript

QE3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

RE3. Dear Editor, we have uploaded our data and code into a repository in Zenodo that is linked with the following digital identifier object: https://doi.org/10.5281/zenodo.8079155 [11], and it is indexed in OpenAIR. This information has been updated in the manuscript in the code availability section and the cover letter, as you suggested in your feedback.

[11] Illueca Fernández E, Fernandez Llatas C, Jara Valera AJ, Fernández Breis JT, Seoane Martinez F. Sequence Oriented Process Mining (v1.0.0). Zenodo. 2023; https://doi.org/10.5281/zenodo.8079155

QE4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

RE4. Dear editor, we have performed an exhaustive revision of the whole bibliography to check that there are no retracted articles. To clarify this, DOIs have been added to all the references when possible. The new articles added to the manuscript have been specified in this rebuttal letter in their corresponding answer. The following paper has been removed from the manuscript as it does not have a consistent DOI

Olsson D, Brabäck L, Forsberg B. Air pollution exposure during pregnancy and infancy and childhood asthma. European Respiratory Journal. 2014;44(Suppl 58).

In addition, the reference “of Disease Collaborative Network GB. Global Burden of Disease Study 2019 (GBD 2019) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME); 2020.” has been changed to the following one, which summarises the main insights of this study.

Murray CJ, Abbafati C, Abbas KM, Abbasi M, Abbasi-Kangevari M, Abd-AllahF et al. Five insights from the global burden of disease study 2019 The Lancet. 2020; p. 1135-1159.

Yours sincerely,

Eduardo Illueca

Attachment

Submitted filename: Rebutal Letter.pdf

Click here for additional data file.^{(256.2KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0290372.r003

Decision Letter 1

Sathishkumar V E

8 Aug 2023

Sequence-Oriented Sensitive Analysis for PM2.5 exposure and risk assessment using Interactive Process Mining

PONE-D-23-03914R1

Dear Dr. Illueca Fernandez,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0290372.r004

Acceptance letter

Sathishkumar V E

11 Aug 2023

PONE-D-23-03914R1

Sequence-Oriented Sensitive Analysis for PM2.5 exposure and risk assessment using Interactive Process Mining

Dear Dr. Illueca Fernández:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sathishkumar V E

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Data. Air quality data.

(CSV)

Click here for additional data file.^{(40.4MB, csv)}

S2 Data. Population data.

(CSV)

Click here for additional data file.^{(1.1MB, csv)}

S3 Data. Base scenario sequences.

(CSV)

Click here for additional data file.^{(19.6MB, csv)}

S4 Data. Scenario 1 sequences.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S5 Data. Scenario 2 sequences.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S6 Data. Scenario 3 sequences.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

S7 Data. Scenario 4 sequences.

(CSV)

Click here for additional data file.^{(19.4MB, csv)}

S8 Data. Scenario 5 sequences.

(CSV)

Click here for additional data file.^{(19.5MB, csv)}

Attachment

Submitted filename: Rebutal Letter.pdf

Click here for additional data file.^{(256.2KB, pdf)}

Data Availability Statement

All data and code files are available from the Zenodo repository (http://doi.org/10.5281/zenodo.8079155).

[pone.0290372.ref001] 1.Organization WH, et al. Review of evidence on health aspects of air pollution: REVIHAAP project: technical report. World Health Organization. Regional Office for Europe; 2021. [PubMed]

[pone.0290372.ref002] 2. Ortiz A, Guerreiro C, Soares J, et al. Air quality in Europe-2020 report. Europea n Environment Agency. 2020; p. 164–164. [Google Scholar]

[pone.0290372.ref003] 3. Cohen AJ, Brauer M, Burnett R, Anderson HR, Frostad J, Estep K, et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. The lancet. 2017;389(10082):1907–1918. doi: 10.1016/S0140-6736(17)30505-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref004] 4. Pascal M, de Crouy Chanel P, Wagner V, Corso M, Tillier C, Bentayeb M, et al. The mortality impacts of fine particles in France. Science of the Total Environment. 2016;571:416–425. doi: 10.1016/j.scitotenv.2016.06.213 [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref005] 5. Wu PC, Cheng TJ, Kuo CP, Fu JS, Lai HC, Chiu TY, et al. Transient risk of ambient fine particulate matter on hourly cardiovascular events in Tainan City, Taiwan. PloS one. 2020;15(8):e0238082. doi: 10.1371/journal.pone.0238082 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref006] 6. Klepeis NE. Modeling human exposure to air pollution. Human exposure analysis. 2006; p. 445–470. doi: 10.1201/9781420012637.ch19 [DOI] [Google Scholar]

[pone.0290372.ref007] 7. Limaye VS, Schöpp W, Amann M. Applying integrated exposure-response functions to PM2. 5 pollution in India. International Journal of Environmental Research and Public Health. 2019;16(1):60. doi: 10.3390/ijerph16010060 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref008] 8. Valari M, Markakis K, Powaga E, Collignan B, Perrussel O. EXPLUME v1. 0: a model for personal exposure to ambient O3 and PM2.5. Geoscientific Model Development. 2020;13(3):1075–1094. doi: 10.5194/gmd-13-1075-2020 [DOI] [Google Scholar]

[pone.0290372.ref009] 9. Lull JJ, Bayo JL, Shirali M, Ghassemian M, Fernandez-Llatas C. Interactive process mining in iot and human behaviour modelling. Interactive process mining in healthcare. 2021; p. 217–231. [Google Scholar]

[pone.0290372.ref010] 10. Sun Y, Song H, Jara AJ, Bie R. Internet of things and big data analytics for smart and connected communities. IEEE access. 2016;4:766–773. doi: 10.1109/ACCESS.2016.2529723 [DOI] [Google Scholar]

[pone.0290372.ref011] 11. Sokhi RS, Moussiopoulos N, Baklanov A, Bartzis J, Coll I, Finardi S, et al. Advances in air quality research–current and emerging challenges. Atmospheric chemistry and physics. 2022;22(7):4615–4703. doi: 10.5194/acp-22-4615-2022 [DOI] [Google Scholar]

[pone.0290372.ref012] 12. Hoek G, Krishnan RM, Beelen R, Peters A, Ostro B, Brunekreef B, et al. Long-term air pollution exposure and cardio-respiratory mortality: a review. Environmental health. 2013;12(1):1–16. doi: 10.1186/1476-069X-12-43 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref013] 13. Sofwan NM, Mahiyuddin WRW, Latif MT, Ayub NA, Yatim ANM, Mohtar AAA, et al. Risks of exposure to ambient air pollutants on the admission of respiratory and cardiovascular diseases in Kuala Lumpur. Sustainable Cities and Society. 2021;75:103390. doi: 10.1016/j.scs.2021.103390 [DOI] [Google Scholar]

[pone.0290372.ref014] 14. Zhang C, Hu Y, Adams MD, Bu R, Xiong Z, Liu M, et al. Distribution patterns and influencing factors of population exposure risk to particulate matters based on cell phone signaling data. Sustainable Cities and Society. 2023;89:104346. doi: 10.1016/j.scs.2022.104346 [DOI] [Google Scholar]

[pone.0290372.ref015] 15. Fisher K, Gershuny J, Gauthier A. Multinational time use study: user’s guide and documentation. Centre for Time Use Research, University of Oxford. 2012;. [Google Scholar]

[pone.0290372.ref016] 16. Menut L, Bessagnet B, Khvorostyanov D, Beekmann M, Blond N, Colette A, et al. CHIMERE 2013: a model for regional atmospheric composition modelling. Geoscientific model development. 2013;6(4):981–1028. doi: 10.5194/gmd-6-981-2013 [DOI] [Google Scholar]

[pone.0290372.ref017] 17. Miao C, Yu S, Hu Y, Bu R, Qi L, He X, et al. How the morphology of urban street canyons affects suspended particulate matter concentration at the pedestrian level: An in-situ investigation. Sustainable Cities and Society. 2020;55:102042. doi: 10.1016/j.scs.2020.102042 [DOI] [Google Scholar]

[pone.0290372.ref018] 18. Bessagnet B, Couvidat F, Lemaire V. A statistical physics approach to perform fast highly-resolved air quality simulations–A new step towards the meta-modelling of chemistry transport models. Environmental Modelling & Software. 2019;116:100–109. doi: 10.1016/j.envsoft.2019.02.017 [DOI] [Google Scholar]

[pone.0290372.ref019] 19. Wang F, Chen D, Cheng S, Li J, Li M, Ren Z. Identification of regional atmospheric PM10 transport pathways using HYSPLIT, MM5-CMAQ and synoptic pressure pattern analysis. Environmental Modelling & Software. 2010;25(8):927–934. doi: 10.1016/j.envsoft.2010.02.004 [DOI] [Google Scholar]

[pone.0290372.ref020] 20. Silibello C, Calori G, Brusasca G, Giudici A, Angelino E, Fossati G, et al. Modelling of PM10 concentrations over Milano urban area using two aerosol modules. Environmental Modelling & Software. 2008;23(3):333–343. doi: 10.1016/j.envsoft.2007.04.002 [DOI] [Google Scholar]

[pone.0290372.ref021] 21. Héroux ME, Anderson HR, Atkinson R, Brunekreef B, Cohen A, Forastiere F, et al. Quantifying the health impacts of ambient air pollutants: recommendations of a WHO/Europe project. International journal of public health. 2015;60:619–627. doi: 10.1007/s00038-015-0690-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref022] 22. Burnett R, Chen H, Szyszkowicz M, Fann N, Hubbell B, Pope CA III, et al. Global estimates of mortality associated with long-term exposure to outdoor fine particulate matter. Proceedings of the National Academy of Sciences. 2018;115(38):9592–9597. doi: 10.1073/pnas.1803222115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref023] 23. Lelieveld J, Klingmüller K, Pozzer A, Pöschl U, Fnais M, Daiber A, et al. Cardiovascular disease burden from ambient air pollution in Europe reassessed using novel hazard ratio functions. European heart journal. 2019;40(20):1590–1596. doi: 10.1093/eurheartj/ehz135 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref024] 24. Van Der Aalst W. Process mining: data science in action. Springer; 2016. [Google Scholar]

[pone.0290372.ref025] 25. Seinfeld JH, Pandis SN. Properties of atmospheric aerosols Atmospheric Chemistry and Physics: From Air Pollution to Climate Change. 2006. [Google Scholar]

[pone.0290372.ref026] 26. Görner P, Simon X, Bémer D, Lidés G. Workplace aerosol mass concentration measurement using optical particle counters. Journal of Environmental Monitoring. 2011; pp 310–317. [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref027] 27.Butler H, Daly M, Doyle A, Gillies S, Hagen S, Schaub T. The geojson format; 2016. https://www.rfc-editor.org/rfc/rfc7946

[pone.0290372.ref028] 28. Guak S, Lee SG, An J, Lee H, Lee K. A model for population exposure to PM2. 5: Identification of determinants for high population exposure in Seoul. Environmental Pollution. 2021;285:117406. doi: 10.1016/j.envpol.2021.117406 [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref029] 29. Kodros JK, O’Dell K, Samet JM, L’Orange C, Pierce JR, Volckens J. Quantifying the health benefits of face masks and respirators to mitigate exposure to severe air pollution. GeoHealth. 2021;5(9):e2021GH000482. doi: 10.1029/2021GH000482 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref030] 30. Rovira J, Domingo JL, Schuhmacher M. Air quality, health impacts and burden of disease due to air pollution (PM10, PM2. 5, NO2 and O3): Application of AirQ+ model to the Camp de Tarragona County (Catalonia, Spain). Science of the total environment. 2020;703:135538. doi: 10.1016/j.scitotenv.2019.135538 [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref031] 31. Murray CJ, Abbafati C, Abbas KM, Abbasi M, Abbasi-Kangevari M, Abd-Allah F et al. Five insights from the global burden of disease study 2019 The Lancet. 2020; p. 1135–1159. doi: 10.1016/S0140-6736(20)31404-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref032] 32. Fernandez-Llatas C. Interactive process mining in healthcare. Springer; 2021. [Google Scholar]

[pone.0290372.ref033] 33. Fernandez-Llatas C, Valdivieso B, Traver V, Benedi JM. Using process mining for automatic support of clinical pathways design. Data mining in clinical medicine. 2015; p. 79–88. doi: 10.1007/978-1-4939-1985-7_5 [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref034] 34. Weijters AJJM, Ribeiro JTS. Flexible heuristics miner (FHM). In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). 2011; pp 310–317. doi: 10.1109/CIDM.2011.5949453 [DOI] [Google Scholar]

[pone.0290372.ref035] 35. de Medeiros AKA, Weijters AJMM, van der Aalst WMP. Genetic process mining: an experimental evaluation Data Mining and Knowledge Discovery. 2007; pp 245–304. [Google Scholar]

[pone.0290372.ref036] 36. Ibanez-Sanchez G, Fernandez-Llatas C, Martinez-Millana A, Celda A, Mandingorra J, Aparici-Tortajada L, et al. Toward value-based healthcare through interactive process mining in emergency rooms: the stroke case. International journal of environmental research and public health. 2019;16(10):1783. doi: 10.3390/ijerph16101783 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref037] 37. Fernandez-Llatas C, Lizondo A, Monton E, Benedi JM, Traver V. Process mining methodology for health process tracking using real-time indoor location systems. Sensors. 2015;15(12):29821–29840. doi: 10.3390/s151229769 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref038] 38. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949; p. 99–114. doi: 10.2307/3001913 [DOI] [PubMed] [Google Scholar]

[pone.0290372.ref039] 39. Dessimond B, Annesi-Maesano I, Pepin JL, Srairi S, Pau G. Academically produced air pollution sensors for personal exposure assessment: The canarin project. Sensors. 2021;21(5):1876. doi: 10.3390/s21051876 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290372.ref040] 40. Wells EM, Dearborn DG, Jackson LW. Activity change in response to bad air quality, National Health and Nutrition Examination Survey, 2007–2010. PloS one. 2012;7(11):e50526. doi: 10.1371/journal.pone.0050526 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sequence-oriented sensitive analysis for PM2.5 exposure and risk assessment using interactive process mining

Eduardo Illueca Fernández

Carlos Fernández Llatas

Antonio Jesús Jara Valera

Jesualdo Tomás Fernández Breis

Fernando Seoane Martinez

Roles

Abstract

Introduction

State of the art

Materials and methods

Fundamentals and basic concepts

Synthetic data generation

Fig 1. Synthetic data generation workflow.

Sequence-oriented sensitive analysis

KPIs computation

Results

Dataset generation and initial KPIs

Fig 3. Dashboard with base KPIs of the sample population.

ANOVA analysis

Fig 4. ANOVA analysis for the exposure among activities.

Fig 5. ANOVA analysis for the mortality relative risk among activities.

Interactive Process Mining

Fig 6. Process map of the daily activities in the populations.

Fig 7. Process map of the traces (citizens) with high exposures.

Scenario 1: Increase private car velocity

Fig 8. Dashboard with KPIs for Scenario 1, in which population exposure and percentage of risky activities are reduced.

Scenario 2: Improving building infrastructures

Fig 9. Dashboard with KPIs for Scenario 2, in which population exposure, mortality relative risk and percentage of risky activities are reduced.

Scenario 3: Recommend safe running places

Fig 10. Dashboard with KPIs for Scenario 3, in which population exposure is reduced.

Scenario 4: Use face masks in outdoor and public environments

Fig 11. Dashboard with KPIs for Scenario 4, in which all KPIs are reduced.

Scenario 5: Scenario 2 + Scenario 4

Fig 12. Dashboard with KPIs for Scenario 5, in which all KPIs are reduced in a greater percentage in comparison with Fig 11.

Analysis of model variability

Table 1. KPIs for different iterations of the model.

Discussion

Fig 13. IoT architecture.

Conclusion

Supporting information

Acknowledgments

Code availability

Data Availability

Funding Statement

References

Decision Letter 0

Sathishkumar V E

Roles

Author response to Decision Letter 0

Decision Letter 1

Sathishkumar V E

Roles

Acceptance letter

Sathishkumar V E

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases