Abstract
This research used data mining approaches to better understand factors affecting the formation of secondary organic aerosol (SOA). Although numerous laboratory and computational studies have been completed on SOA formation, it is still challenging to determine factors that most influence SOA formation. Experimental data were based on previous work described by Offenberg et al. (2017), where volume concentrations of SOA were measured in 139 laboratory experiments involving the oxidation of single hydrocarbons under different operating conditions. Three different data mining methods were used, including nearest neighbor, decision tree, and pattern mining. Both decision tree and pattern mining approaches identified similar chemical and experimental conditions that were important to SOA formation. Among these important factors included the number of methyl groups for the SOA precursor, the number of rings for the SOA precursor, and the presence of dinitrogen pentoxide (N2O5).
Keywords: SOA formation, chamber experiment, nearest neighbor, decision tree, pattern mining
1. Introduction
Data mining is a growing field of data analysis, with applications ranging from business and banking to medicine (Han et al., 2012). Also known as knowledge discovery from data (KDD), data mining involves a logical ordering or grouping of complex, relational data that can potentially reveal intrinsic patterns in the underlying dataset. Because data mining techniques can be scaled to the size and dimensionality of the dataset, and because these techniques can describe both categorical and numerical information, data mining is particularly well-suited for the high dimensionality and complexity that is typical of scientific fields like atmospheric chemistry. Specifically, we use data mining techniques to better understand the formation of secondary organic aerosol (SOA).
Previous research from both experimental and computational studies have noted the high level of complexity of SOA formation (e.g., as reviewed in Gentner et al., 2017; Ziemann and Atkinson, 2012; Ervens et al., 2011; Hallquist et al., 2009; Kroll and Seinfeld, 2008). In addition, existing laboratory studies on SOA formation are often limited in terms of the range of conditions studied, and mostly focus on a small number of SOA precursors (e.g., isoprene, α-pinene, toluene). A systematic laboratory study that considers all potential permutations relevant to SOA formation in the atmosphere would be infeasible. Because organic aerosol depends on location, environmental conditions, and chemical concentrations and properties of numerous potential precursors and oxidants, SOA formation in the atmosphere is layered with the complexity and embedded dependencies that could be addressed through data mining approaches.
Using data mining methods, our intent here is to use a data-driven approach to identify factors most important to SOA formation. The findings from this analysis would be beneficial to atmospheric chemistry research in two ways. First, existing studies on SOA formation are usually based on a limited range of experimental conditions, such that an assessment of a study’s broader implications is often speculative in nature. Second, results from this analysis can provide insight on potential operating conditions for future experimental studies. Experimental conditions are typically chosen based on a limited number of broad categories, including the choice of SOA precursor, the choice of oxidant, and initial conditions in the laboratory chamber. Future experimental work would benefit from an analysis that systematically assesses these factors relative to SOA formation.
The purpose of this paper is to use data mining techniques to better understand SOA formation. We use several established methods (nearest neighbor, decision trees, and pattern mining) to identify factors important to SOA formation. A total of 139 experiments on SOA formation are reported from an EPA laboratory chamber operated under controlled conditions. This analysis thus represents an extensive database of experimental studies completed over several years from a wide range of compound classes, including alkanes, branched alkanes, alcohols, aromatics, dienes, branched dienes, and terpenes.
2. Methods
2.1. Experimental description
Controlled experiments of photochemical reactions were completed in a solid walled chamber (Edney et al., 2005). The 14.5-m3 chamber was TFE Teflon coated and operated under steady-state conditions. Results from 139 experiments included the following compounds or compounds classes: isoprene, monoterpenes, sesquiterpenes, aromatic compounds, n-alkanes, and oxygenated compounds. A detailed listing of precursor hydrocarbons used and associated experimental conditions is given elsewhere (Offenberg et al., 2017). Hydrocarbon precursors were either injected from high pressure cylinders of neat compounds in air via mass flow controllers, from air passed through an impinger consisting of neat liquid, or from direct injection of neat liquid via a syringe pump. Experiments used one of the following oxidants: nitrogen oxides (NOx), hydrogen peroxide (H2O2), ozone (O3), or dinitrogen pentoxide (N2O5). Nitrogen oxides were added to the chamber via pressurized cylinders (as nitrogen oxide in nitrogen). Hydrogen peroxide was added by injection of H2O2 solution into a heated bulb flushed with approximately 3 L/min of clean air. Ozone was injected into the chamber by flowing clean dry air over an ozone-generating UV lamp. Dinitrogen pentoxide was added by passing air through the headspace of a chilled impinger containing solid N2O5 crystals. Steady-state concentrations of NOx in the chamber ranged from 59 to 632 ppbv for that oxidant condition, while those for O3 ranged from 125 to 200 ppbv. Typical values of the initial oxidant concentrations ranged from 2.5 to 10 ppmv for H2O2 and from 0.02 to 0.58 ppmv for N2O5.
A gas chromatograph with flame ionization detector (GC-FID, Hewlett-Packard Model 5890 GC) was used to measure concentrations of precursor hydrocarbon during experiments. A scanning mobility particle sizer (SMPS) was used to measure aerosol size distributions (model 3071A; TSI, Inc., Shoreville, MN). The following operating conditions were used for the SMPS: sample flow of 0.2 L/min; sheath flow of 2 L/min; and scans from 16 to 982 nm. All experiments used ammonium sulfate as seed aerosol from a 10 mg/L aqueous solution of ammonium sulfate (model 9032; TSI, Inc., Shoreville, MN) that was nebulized. As previously reported by Offenberg et al. (2017), the resulting distributions of seed aerosols had a volume mode diameter that was generally between 100 and 400 nm.
2.2. Data analysis
A total of 16 variables, including 15 inputs and 1 output, were considered for this analysis (Table 1). All 16 variables were reported for the 139 experiments used for this research. These same 16 variables were also reported previously (Offenberg et al., 2017); we have assumed the same variable list to maintain consistency between the two studies. The variables shown in Table 1 included binary values, continuous values, and integers. Four of the input variables were binary. These included the presence (or absence) of specific oxidants injected into the laboratory chamber during an experiment: NOx, H2O2, O3, or N2O5. Four of the input variables were real-valued continuous data. These included the relative humidity in the chamber (%), the volumetric residence time of the chamber (hr), the concentration of reacted hydrocarbon (ΔHC, ppmC), and the molecular weight of the SOA precursor (g/mol). An additional continuous variable, the initial NOx concentration in the chamber, was used as a binary variable (present/not present) to be consistent with previous work as described in Offenberg et al. (2017). Seven of the input variables were integers. These included number values describing the following chemical characteristics of the precursor hydrocarbon: the number of carbon atoms, the number of internal double bonds, the number of external double bonds, the number of oxygen atoms, the number of rings, the number of aromatic rings, and the number of methyl groups. One additional integer variable, the number of ethyl groups, was present in only a small number of experiments (n = 3). For this reason, the number of methyl groups and ethyl groups were combined as one aggregate value and will hereafter be referred to as the number of methyl groups. An additional variable, the latent heat of vaporization, was measured for all 139 experiments and is discussed extensively by Offenberg et al. (2017).
Table 1.
Summary of variables used in data mining analyses.
| Variable | Type | range | categorya | |
|---|---|---|---|---|
| low | high | |||
| NOx present | Binary | 0,1 | off | on |
| H2O2 present | Binary | 0,1 | off | on |
| O3 present | Binary | 0,1 | off | on |
| N2O5 present | Binary | 0,1 | off | on |
| Relative humidity (%) | real | 2-50 | <30 | ≥30 |
| Residence time (hr) | real | 3.4-10.1 | <4.2 | ≥4.2 |
| ΔHC (ppmC) | real | 0.1-19.7 | <2.2 | ≥2.2 |
| Molecular weight (g/mol) | real | 54.09-82.46 | <128.2 | ≥128.2 |
| Volume concentration of aerosol (nL/m3) | real | 2.2-382.1 | <64 | ≥64 |
| number of carbon atoms | integer | 4-18 | <10 | ≥10 |
| number of internal double bonds | integer | 0-5 | <2 | ≥2 |
| number of external double bonds | integer | 0-2 | =0 | ≥1 |
| number of oxygen atoms | integer | 0-2 | =0 | ≥1 |
| number of rings | integer | 0-3 | =0 | ≥1 |
| number of aromatic rings | integer | 0-2 | =0 | ≥1 |
| number of methyl groups | integer | 0-4 | <2 | ≥2 |
: “High” and “low” categories were determined based on median value of each variable.
We selected SOA volume as the output variable and not SOA mass (or yield) for two reasons. First, mass determinations for this study are based on SMPS measurements, meaning that converting volume to SOA mass requires multiple assumptions (e.g., spherical geometry and unit density). Our intent here is to apply data mining methods with minimal user-driven correction factors. Second, our previous analysis of enthalpy was determined from the SOA volumes (Offenberg et al., 2017), meaning that both analyses can be reported on a consistent basis. Measured wall losses for this experimental system were approximately 0.06 hr−1. Because of the relatively low rate of wall losses, coupled with the fact that experiments were completed under steady-state flow conditions, uncorrected particle volumes will be used for the remainder of this paper. This approach is consistent with our intent to have minimal user-driven correction factors and is also consistent with our previous analysis on heats of vaporization (Offenberg et al., 2017).
The input variables from Table 1 are also summarized in Figures S1-S8 (Supporting Information) in the form of frequency plots. One observation from these frequency plots is the non-uniform distribution of input values for most of the variables. While a uniform sampling approach may be preferred for some statistical methods, it is not necessary for the analyses described herein. All data mining methods in this paper are based on the relative amounts of a given variable, meaning that oversampling from a particular input variable will not bias resulting analyses.
The variables listed in Table 1 are in general poorly correlated between variable pairs. Most variable pairs had r2 values less than 0.5. The few exceptions to this pattern were for variable pairs that are implicitly related, such as the number of carbons and molecular weight (r2 = 0.93). In addition, the variables in Table 1 are not numerically consistent, both in terms of their ranges and mathematical types. For several of these variables, the range of possible values covers multiple orders of magnitude. These numerical discrepancies directly impact the interpretation of factors affecting SOA formation. For example, separate categories for hydrocarbon precursors having nearly identical carbon numbers is unlikely to be physically meaningful. Furthermore, using some combination of the discretely-valued inputs in Table 1 is unlikely to accurately produce a regressed continuous output variable such as aerosol mass. For these reasons, a more consistent way of describing the variables in Table 1 is needed.
For the purposes of this analysis, we converted all variables to binary categorical values using the “low” and “high” designations shown in Table 1. Thus, the dataset used for all three data mining methods consisted of 15 binary inputs (low or high) and one binary output (low or high) for all 139 experiments. For each experiment, variables were classified as either “low” or “high” based on the median value for that variable. This allows for all variables to be combined in a consistent way. Alternate formulations of the dataset, such as some combination of binary and continuous variables, would likely result in severe under- or over-fitting of the measured data. Lastly, we note that data mining algorithms are often formulated to better predict a category variable such as a decision or classification. Thus, a wider array of data mining techniques can be used by uniformly applying this binary classifier for all variables.
2.3. Data mining methods
In this section, we describe the data mining methods used in this research: nearest neighbor, decision trees, and pattern mining. All three methods involved supervised training in the analysis, meaning that the output variable (high or low SOA volume) was set based on the experimental data. Each method used the 15 input variables listed in Table 1 along with the categorical output of low or high concentration of SOA volume. All analyses were coded in SAS version 9.4 (SAS Institute Inc., Cary, NC). The nearest neighbor analysis (described below) was written in matrix form using the Interactive Matrix Language (PROC IML) in SAS version 9.4.
2.3.1. Nearest neighbor
A nearest neighbor method calculates the distance that a given data point is from the remaining points in a dataset. The distance for this method is determined by using a similarity distance summed across all input variables. Nearest neighbor methods do not require pre-existing assumptions on the underlying data and the results can be used inform more complex analyses. Because the variables used in this analysis were all converted to binary values (high or low), the distance measure in this case is a similarity index. The distance between two data points for a given variable is then either similar or dissimilar. Across all variables, the similarity index can be written mathematically as:
| (1) |
where d represents a distance measure between data points X and Y, δ is a similarity index (valued 0 or 1) between the ith variables of those data points, and m is the number of variables. A dissimilarity index for cluster analysis of category variables is given by Huang (1998) and is analogous to Equation 1. Summing across all data points gives the distance of data point X from the entire dataset. In this case, data are measured by their proximity to either low or high SOA volumes. More detailed descriptions of nearest neighbor approaches are given elsewhere (Huang, 1998; Bishop, 2005; Han et al., 2012).
2.3.2. Decision trees
Decision trees are configurations of branches and nodes used to separate data through a series of decisions (Han et al., 2012). Decision trees can illustrate the hierarchal nature of input variables as related to an outcome. These decisions can be described using specified categories of outcomes, which for this analysis results in category outcomes of either low or high SOA concentration. Branches emanating from a node are based on specific criteria used to segregate the data. The resulting flowchart of decisions illustrates important variables that factor into determining an outcome. Decision trees can be generated without pre-existing knowledge of the data structure and without initial estimates of model parameters, making it well-suited to exploratory data analysis.
Each node involves a test condition that is used to apportion the data. Each variable is measured based on its ability to isolate the high (or low) SOA experiments. We use a performance measure called the precision to determine the best variable for a node. The precision is defined (Han et al., 2012) as the ratio of true positives to all positives (true positives and false positives). The precision then gives a direct indication of how well the input variables can predict high (or low) SOA volumes. The variable having the highest precision is used for a given node and the data are then segregated according to the median values listed in Table 1. Branches emanating from a node are used for subsequent nodes if they contain at least 10% of the measured data; otherwise those data are classified as a terminal node and that part of the tree is then complete.
2.3.3. Pattern mining
Pattern mining is a method of identifying combinations of variables that occur together frequently. These combinations are especially important when their occurances lead to a specific outcome. Groups of variables that coincide frequently are called an “itemset.” For example, a 2-itemset refers to two variables that appear together frequently. Because large numbers of variables lead to an exponential increase in the number of combinations, two performance measures called support and confidence are used to screen for combinations of importance. The support and confidence of a variable combination are defined as (Han et al., 2012):
| (2) |
| (3) |
where P is the probability, and A and B are specific variables. Thus, support represents the joint probability of a group of variables while confidence represents its conditional probability. Also noteworthy is that support and confidence are independent of sample size, given that they are both defined as probabilities. For this study, minimum thresholds of 5% support and 30% confidence were used. More detailed descriptions of pattern mining methods are given elsewhere (e.g., Han et al., 2012).
For datasets with high dimensionality, pattern mining can be effective at revealing important interdependencies among an array of potential variable combinations. Because confidence is based on conditional probabilities, it is expected to better identify important factors affecting SOA formation. Specifically, an important consideration here is that some variables are prone to a high number of false negatives. One example is the presence of N2O5, which was only used in a few experiments and yet consistently resulted in high SOA volumes. Conversely, most experiments in this study were completed without N2O5, meaning that false negatives (high SOA experiments without N2O5) occurred frequently for this condition. Measures of confidence are less likely to penalize this artifact of the dataset as compared with the support measure, which is only based on their joint probability.
3. Results and Discussion
The overall objective of this paper is to use data mining techniques to better understand factors important to SOA formation. Among these factors, we examine common considerations with respect to experimental design, including the chemical characteristics of the precursor, the choice of oxidant, and the initial conditions in the experimental chamber. The complex nature of these experimental considerations is shown graphically in Figure 1, where volume concentrations of SOA formed (nL/m3) are shown as a function of three continuous variables (molecular weight, concentration of reacted hydrocarbon, and residence time). Results similar to Figure 1 can be seen for the remaining variables considered in this study and are shown in Figures S9-S11 (Supporting Information). Even when separating SOA volumes by oxidant, associations between SOA volumes and input variables are unclear. As suggested from Figure 1, oxidant choice alone is generally a poor indicator of high SOA. For example, 41% of the high SOA experiments used NOx as an oxidant, an identical percentage (41%) to the proportion of low SOA experiments using NOx. Similarly, 39% of the high SOA experiments used H2O2 as an oxidant, similar to the proportion of low SOA experiments (34%). The main exception to this pattern is for N2O5, an oxidant that is discussed in more detail below. Thus, the hierarchical nature of data mining approaches is expected to be useful in better understanding factors contributing to SOA formation.
Figure 1.
Volume concentrations of SOA (nL/m3) as a function of the following experimental conditions: (a) molecular weight of the SOA precursor (g/mol), (b) concentration of reacted hydrocarbon (ppmC), and (c) residence time in the chamber (hr).
3.1. Nearest neighbor
An example nearest neighbor result is shown in Figure 2, where concentrations of SOA volumes are shown as a function of either the number of rings or the number of methyl groups. Each experiment is classified by a similarity distance that is nearest to either low or high SOA volumes. While the nearest neighbor distances were based on the binary classifications of low or high volumes, actual values of SOA volumes are plotted in Figure 2 to better visualize the data.
Figure 2.

Nearest neighbor results for categories of high and low SOA volume as a function of (a) the number of rings and (b) the number of methyl groups.
The data in Figure 2 show some separation by input variables, though in both cases the data are poorly separated by the output variable. For the number of rings, all experiments classified as low volume used hydrocarbon precursors with either zero or one ring. Conversely, the experiments classified as high volume were more distributed among the range of input values (0-3 rings). For the number of methyl groups, most of the low methyl experiments (<2 methyl groups) are classified as a low SOA experiment. However, the high methyl group shows poor separation in Figure 2, a result consistent with the high number of rings.
For the output variable in Figure 2, SOA concentrations are poorly separated. In fact, only three of the highest six experiments in terms of SOA volume were classified as high aerosol. Mean values of the high SOA category (88.8 nL/m3) shown in Figure 2 are higher than the low SOA category (65.1 nL/m3), though this difference is not statistically significant given the overall high standard deviations of the two categories (>60 nL/m3). Given the limitations of the nearest neighbor approach for these data, more complex hierarchical methods are needed to better quantify differences among input variables.
3.2. Decision trees
Figure 3 shows a decision tree constructed using high SOA volumes as the outcome variable. The left branch leaving each node indicates a low SOA volume grouping while the right branch indicates a high-volume grouping. In general, the high-valued input variables are used to separate the high SOA experiments and the low-valued variables are used to separate the low SOA experiments. The only exception to this pattern is for low concentrations of reacted hydrocarbon, which is used to separate the high SOA experiments on the left-hand side of Figure 3.
Figure 3.

Decision tree using high SOA volumes as the outcome variable.
The first node in Figure 3 (the number of methyl groups) was selected based on the highest precision value. For a high number of methyl groups to be associated with high SOA volumes is consistent with previously reported experimental results. While the tendency for a methylated analogue to produce more SOA than the parent compound is expected (e.g., toluene versus benzene), the splitting criterion in this case is for greater than two methyl groups. Thus, the high SOA group from this node includes several biogenic compounds (e.g., α-pinene, β-pinene, β-caryophyllene, and limonene). Conversely, the category with a low number of methyl groups includes several compounds that typically produce lower SOA yields (e.g., straight-chained alkanes, isoprene).
For the high methyl group, the data are further separated in the next node with the presence of N2O5, a compound that is a known strong oxidant in experiments of SOA formation. Because of the low number of experiments that included N2O5 in the initial conditions (n= 15), the high SOA data from this node becomes a terminal node. Caution should be exercised when interpreting the results stemming from the N2O5 experiments, due in large part to the relatively small number of experiments (n=15) for this oxidant condition. The resulting subset of experiments consisted of only seven different SOA precursors: α-pinene, β-pinene, β-caryophyllene, d-limonene, isoprene, naphthalene, and oleic acid. These results clearly indicate the importance of N2O5 as an indicator of SOA formation, though a wider range of chemical precursors and experimental conditions would assist in supporting this finding.
The results in Figure 3 are similar to findings reported by Offenberg et al. (2017) in that both methyl groups and N2O5 were included in a neural network model formulated with the same experimental data. The previous study involved enthalpies of vaporization and not SOA formation as the output variable, and as such the results are not directly comparable. However, the relevance of both methyl groups and N2O5 to these studies indicates in general the importance of these variables to aerosol processes.
Figure 4 shows the decision tree constructed using low SOA volumes as the outcome variable. This tree has the same general configuration as in Figure 3, where the left branch indicates a low volume grouping and the right branch indicates a high-volume grouping. The only difference is that Figure 4 is constructed using low volumes as the decision outcome, meaning the choice of variable for each node is based on the precision in predicting low volumes. Because the number of false positives and false negatives in predicting SOA volumes will differ among each of the variables, this will result in different decision trees.
Figure 4.

Decision tree using low SOA volumes as the outcome variable.
The highest precision for the first node in Figure 4 is for the residence time, where low residence times (<4.2 hr) are associated with low SOA concentrations. These areas of low SOA concentrations are for conditions of low carbon number (<10) and high relative humidity (≥30%). The only area of high SOA concentration on the left side of Figure 4 is for the combination of high carbon number (≥10) and low relative humidity (<30%).
Because the residence time affects numerous other factors in SOA formation (e.g., time for chemicals to react, concentration of reacted hydrocarbon, steady-state concentrations in the chamber), this variable can clearly have a complex interpretation. While shorter reaction times might suggest lower SOA volumes, this may simply be an artifact of experiments involving precursors that readily form SOA. In addition, the relatively narrow range of residence times (generally between 4 and 6 hr) means that small differences in input variables could lead to misclassifying an experiment as either high or low SOA.
For the high residence time group (≥4.2 hr), the data are further separated in the next node using the number of rings. The branch for no rings represents the low SOA group, and includes n-alkanes, isoprene, and 1,3-butadiene. The high ring group is further separated in the next node using the number of methyl groups. This grouping (high number of rings, high number of methyl groups) includes many of the same reactive biogenic compounds discussed above in the high SOA groups in Figure 3 (e.g., α-pinene, β-pinene, β-caryophyllene, and limonene).
For the low residence time grouping in Figure 4, the data are further separated in the next node using the number of carbon atoms. This low carbon group includes isoprene, 1,3-butadiene, benzene, toluene ethylbenzene, xylenes, trimethylbenzenes, octane, and nonane. Thus, the low carbon group includes many of the same straight-chain alkanes as the no ring group described above.
Several broad observations can be made from the two decision trees in Figures 3 and 4. First, all aspects of experimental design involving SOA formation (characteristics of the precursor hydrocarbon, choice of oxidant, and operating conditions) are included in the decision trees. To be sure, the number of nodes (i.e., variables) grows exponentially with the number of layers, such that even the three layers shown here could include as many as seven different nodes per tree. However, a key interpretation from Figures 3 and 4 is in the hierarchy of variables, where the variables that best separate high and low SOA volumes (based on the criteria discussed in Section 2.3.2) are at the top of each decision tree. Thus, the most important of these nodes include characteristics of the precursor hydrocarbon (methyl groups, carbon atoms, rings), choice of oxidant (N2O5), and operating conditions (residence time).
Second, chemical characteristics of the precursor hydrocarbon appear throughout the decision trees. Of these, both the size of the molecule (e.g., molecular weight) and its structure are important to SOA formation. Another observation among chemical characteristics stems from definition of the high methyl group as having two or more methyl groups. In this case, the high methyl group (e.g., α-pinene, β-pinene, β-caryophyllene, and limonene) largely overlaps with the high number of rings, the main difference being that the high rings group also includes aromatic rings.
Last, two of the most frequently studied variables in terms of SOA formation, relative humidity and NOx concentration, are relatively unimportant based on the decision trees in Figures 3 and 4. For relative humidity, this finding is largely consistent with existing studies, where the humidity has generally been found to have little to no effect on SOA yields. For initial NOx concentrations, where existing studies have often reported high NOx concentrations leading to lower SOA yields, results are not consistent with those shown in Figures 3 and 4. One possible reason for this discrepancy could be the low number of experiments in this study having relatively high NOx concentrations (n=19 with initial NOx concentration > 300 ppb).
3.3. Pattern mining
Figure 5 shows values of confidence for a 3-itemset analysis involving all 15 variables (i.e., 30 binary values). In this case, two variables are shown in Figure 5 while the third variable is the category value for high SOA volumes. The graph is symmetric about the main diagonal, much like a plot (or table) from a correlation matrix. The highest values in Figure 5 indicate conditions most associated with each other and with high SOA volumes. Associations for a variable can be seen by following that variable down a column (or across a row). Variables with mostly high confidence values include the presence of N2O5, high residence times, high number of methyl groups, high number of carbons, high number of rings, and low concentrations of reacted hydrocarbons.
Figure 5.
Pattern mining results for a 3-itemset analysis of all input variables for the high SOA volume outcome. Values of the confidence are shown. Blank (white) squares indicate variable combinations that did not meet minimum thresholds for either support or confidence. Abbreviations used: DeltaHC = hydrocarbon consumed, MolWt = molecular weight, RH = relative humidity, ResTime = residence time, # ExtDB = number of external double bonds, # IntDB = number of internal double bonds, # aromatics = number of aromatic rings, # carbons = number of carbon atoms, # methyls = number of methyl groups, # oxygens = number of oxygen atoms, # rings = number of rings.
In fact, the highest value in Figure 5 (confidence = 1.0) is for the combination of high residence times and the presence of N2O5, which for a value of 1.0 means that all experiments with high residence times and N2O5 produced high SOA concentrations. The next highest value of confidence was for the combination of N2O5 and a high number of external double bonds (confidence = 0.88), followed by the combination of a high number of methyl groups and a high concentration of reacted hydrocarbon (0.85). These results are also similar to those shown previously for the decision trees (Figures 3 and 4), where a high number of methyl groups, high residence times and the presence of N2O5 were all associated with high SOA experiments.
Figure 6 also shows values of confidence, where results here include the third variable of low SOA volumes. Many of the same variables important to high SOA volumes are also important to low SOA volumes. The main difference is that generally the low-valued inputs are associated with the low SOA volumes, such as a low number of methyl groups or a low number of rings. Other variables also associated with the low SOA experiments include high concentrations of reaction hydrocarbon and low molecular weights (and the closely related low carbon number).
Figure 6.
Pattern mining results for a 3-itemset analysis of all input variables for the low SOA volume outcome. Values of the confidence are shown. Blank (white) squares indicate variable combinations that did not meet minimum thresholds for either support or confidence. Abbreviations used: DeltaHC = hydrocarbon consumed, MolWt = molecular weight, RH = relative humidity, ResTime = residence time, # ExtDB = number of external double bonds, # IntDB = number of internal double bonds, # aromatics = number of aromatic rings, # carbons = number of carbon atoms, # methyls = number of methyl groups, # oxygens = number of oxygen atoms, # rings = number of rings.
The highest value in Figure 6 (confidence = 0.91) is for the combination of high concentrations of reacted hydrocarbon and a high number of external double bonds. The next highest values all include low number of carbons, coupled with either a low number of aromatic rings (confidence = 0.89), a low number of rings (0.89), or a low number of internal double bonds (0.89).
From these pattern mining results, several broad observations can be identified in terms of factors important to SOA formation. First, results from Figures 5 and 6 are similar to those from the decision trees, where the high-valued inputs are generally associated with high SOA and the low-valued inputs with low SOA. Second, several variables identified in Figures 5 and 6 were also important to the decision trees, including the importance of chemical characteristics like the number of methyl groups and the number of rings. Similarly, the presence of N2O5 was the most important oxidant in terms of SOA production. Last, several variables consistently generate high confidence values for high SOA (Figure 5), including N2O5, high residence times, high number of methyl groups, and high number of rings. When these best indicators were used in combination with each other (e.g., N2O5 and high residence times), they generally produced the highest confidence values for the high SOA condition. Thus, the specific combination of variables is an important consideration in assessing the potential for SOA formation. For example, branched alkanes have generally been shown to form little or no SOA, and only exhibit one factor (methyl groups) indicative of SOA formation. Conversely, terpenes are generally associated with high SOA formation, and exhibit multiple factors (e.g., methyl groups and rings).
3.4. Comparison to previous studies
Numerous studies have reported SOA formation from laboratory chambers. Most of these studies focused on a single precursor hydrocarbon. A few studies reported SOA yields from multiple compounds that were each tested individually, although these studies were generally limited to one compound class (e.g., terpenes). For example, Griffin et al. (1999) reported SOA yields from 14 biogenic hydrocarbons. Reported mass yields were highest for sesquiterpenes such as β-caryophyllene and α-humulene (yield ranging from 17-67%), followed by cyclic diolefins such as d-limonene (2-3%), and bicyclic olefins such as Δ3-carene, α-pinene, and β-pinene (2-15%). Hoffmann et al. (1997) reported comparable SOA yields from a similar set of biogenic compounds, where results also included the open chain hydrocarbon linalool (approximately 5%). SOA yields from a similar set of biogenic compounds were also reported by Lee et al. (2006a and b). Numerous other studies have aerosol yields from a single SOA precursor. For example, more recent studies have reported aerosol yields from α-pinene (e.g., Liu and Hopke, 2014; Wang et al., 2014; Liu et al., 2013; Eddingsaas et al., 2012; Wang et al., 2011). For these studies, ranges of aerosol yields from α-pinene were 15-2% (Liu and Hopke, 2014), 6.7-38.7% (Wang et al., 2014), 4.1-11.9% (Liu et al., 2013), 7.7-36.7% (Eddingsaas et al., 2012), and 12.1-8.2% (Wang et al., 2011).
The seven biogenic compounds listed above were also used in this analysis and thus provide a preliminary indication of the influence of chemical structure on SOA formation. All seven of these compounds include variables indicative of high SOA volumes (see Table 1), including a high number of carbon atoms, a high number of methyl groups, and high molecular weights. Among the highest yields are for α-humulene, which is the only compound (from the seven listed above) having a high number of internal double bonds. Conversely, among the lowest yields are for linalool and Δ3-carene, the only two compounds (from the seven listed above) having a low number of rings. Thus, for the subset described herein, the results in this paper are broadly consistent with these previously reported laboratory findings.
The effects of relative humidity and initial NOx concentration on SOA formation have been reported from laboratory studies. For relative humidity, studies have generally reported that the humidity had little or no effect on SOA yields. This has been reported for various compounds, including m-xylene (Cocker et al., 2001b), 1,3,5-trimethylbenzene (Cocker et al., 2001b), toluene (Edney et al., 2000), and α-pinene (Cocker et al., 2001a; Kristensen et al., 2014). One exception to this pattern was reported by Healy et al. (2009) for p-xylene, though a wider range of relative humidities was measured from that work (5-75%) as compared with the current study (2-50%). The work of Healy et al. (2009) reported fewer experiments (n=8) compared with the present study (n=139), which is one possible reason that explains differences between the two studies. In addition, a relatively small number of experiments reported for this study used xylenes as a precursor (7 for m-xylene, 1 for o-xylene, and 1 for p-xylene). Thus, previously reported findings are broadly consistent with the data mining results in this paper, where differences in the relative humidity (when RH<50%) are generally not strong indicators of SOA formation from both the pattern mining and decision tree results.
For the initial NOx concentration, previous studies have generally shown that high NOx concentrations lead to a significant decrease in SOA yields. This finding has been reported for m-xylene (Song et al., 2005; Ng et al., 2007a; Song et al., 2007), toluene (Ng et al., 2007a), benzene (Ng et al., 2007a), and α-pinene (Presto et al., 2005; Ng et al., 2007b; Eddingsaas et al., 2012). This general pattern is reversed for the two sesquiterpenes longifolene and aromadendrene (Ng et al., 2007b), where yields are highest for the high initial NOx condition. The effect of NOx concentration on SOA yields from isoprene is more complex, as described in experimental work by Kroll et al. (2006). In that work, isoprene yields decreased with increasing NOx concentrations at higher concentrations (>200 ppb NOx), while the opposite pattern was observed at lower NOx concentrations.
As shown for both decision trees and pattern mining, NOx concentrations were not identified as having an important effect on SOA volumes. As discussed above in Section 3.2, one possible reason is that most experiments used in this analysis were completed at NOx concentrations less than 300 ppb. Another explanation is that previous studies have typically considered a single hydrocarbon precursor in isolation, comparing low NOx versus high NOx under otherwise similar operating conditions. The present study considers a total of 15 different variables, where a variety of factors (chemical characteristics, choice of oxidant, and initial conditions in the chamber) contributed to the classification of high or low SOA.
4. Conclusions
Despite numerous experimental studies on SOA formation, none have used data mining techniques to better understand factors that most influence SOA formation. The formation of SOA is dependent on a wide array of potentially important factors (properties of the SOA precursor, use of oxidants, and initial conditions of the experiment), and data mining methods can be a powerful tool to better understand the complexity and high dimensionality intrinsic to these experiments. A broad range of experimental conditions were considered for this analysis, and as such the findings should be of interest for future experimental studies on SOA formation. This study focused on describing experimentally determined aerosol yields. However, the results described herein could also lend insight into predicting yields for unknown compounds, particularly for a prescribed set of experimental and chemical conditions. Major observations stemming from this work include the following:
Of the three data mining methods evaluated, both decision trees and pattern mining were most effective at identifying factors associated with SOA formation. The two methods identified a similar set of chemical and experimental conditions important to SOA formation.
All three broad considerations in experimental design (chemical properties of the precursor, choice of oxidant, and initial conditions in the chamber) were identified as factors contributing to SOA production. Both the number of methyl groups and the number of rings were consistently identified as being associated with the high SOA experiments.
Among the four choices of oxidants (NOx, H2O2, O3, or N2O5), the presence of N2O5 was most associated with the high SOA experiments. A relatively small number of experiments included the presence of N2O5 (n=15), meaning that additional experimental research could be useful in validating this finding.
Combinations of variables led to the highest associations with SOA formation, suggesting the importance of interdependencies among variables. One example is for experiments having a high number of methyl groups and the presence of N2O5, which had the highest confidence value from the pattern mining results.
Supplementary Material
Acknowledgments
The U.S. Environmental Protection Agency through its Office of Research and Development funded and collaborated in the research described here under Contract EP-C-15-008 to Jacobs Technology. The manuscript has been subjected to internal review and has been cleared for publication. The views expressed in this article are those of the authors, and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Mention of organization names and trademarks does not constitute endorsement.
References
- Bishop CM, 2006. Pattern Recognition and Machine Learning, Springer, New York, New York. [Google Scholar]
- Cocker DR, Clegg SL, Flagan RC, Seinfeld JH, 2001a. The effect of water on gas-particle partitioning of secondary organic aerosol. Part I: α-pinene/ozone system. Atmospheric Environment 35, 6049–6072. [Google Scholar]
- Cocker DR, Mader BT, Kalberer M, Flagan RC, Seinfeld JH, 2001b. The effect of water on gas-particle partitioning of secondary organic aerosol: II. m-xylene and 1,3,5-trimethylbenzene photooxidation systems. Atmospheric Environment 35, 6073–6085, 2001b. [Google Scholar]
- Eddingsaas NC, Loza CL, Yee LD, Chan M, Schilling KA, Chhabra PS, Seinfeld JH, Wennberg PO, 2012. α-pinene photooxidation under controlled chemical conditions - Part 2: SOA yield and composition in low- and high-NOx environments. Atmospheric Chemistry and Physics 12, 7413–7427. [Google Scholar]
- Edney EO, Driscoll DJ, Speer RE, Weathers WS, Kleindienst TE, Li W, Smith DF, 2000. Impact of aerosol liquid water on secondary organic aerosol yields of irradiated toluene/propylene/NOx/(NH4)2SO4/air mixtures. Atmospheric Environment 34, 3907–3919. [Google Scholar]
- Edney EO, Kleindienst TE, Jaoui M, Lewandowski M, Offenberg JH, Wang W, Claeys M, 2005. Formation of 2-methyl tetrols and 2-methylglyceric acid in secondary organic aerosol from laboratory irradiated isoprene/NOx/SO2/air mixtures and their detection in ambient PM2.5 samples collected in the eastern United States. Atmospheric Environment 39, 5281–5289. [Google Scholar]
- Ervens B, Turpin BJ, Weber RJ, 2011. Secondary organic aerosol formation in cloud droplets and aqueous particles (aqSOA): a review of laboratory, field and model studies. Atmos. Chem. Phys, 11, 11069–11102, doi: 10.5194/acp-11-11069-2011. [DOI] [Google Scholar]
- Gentner DR, Jathar SH, Gordon TD, Bahreini R, Day DA, Haddad IE, Hayes PL, Pieber SM, Platt SM, de Gouw J, Goldstein AH, Harley RA, Jimenez JL, Prévôt ASH, Robinson AL, 2017. Review of urban secondary organic aerosol formation from gasoline and diesel motor vehicle emissions. Environ. Sci. Technol, 51, 1074–1093, doi: 10.1021/acs.est.6b04509. [DOI] [PubMed] [Google Scholar]
- Griffin RJ, Cocker DR, Flagan RC, Seinfeld JH, 1999. Organic aerosol formation from the oxidation of biogenic hydrocarbons. Journal of Geophysical Research: Atmospheres 104, 3555–3567. [Google Scholar]
- Hallquist M, Wenger JC, Baltensperger U, Rudich Y, Simpson D, Claeys M, Dommen J, Donahue NM, George C, Goldstein AH, Hamilton JF, Herrmann H, Hoffmann T, Iinuma Y, Jang M, Jenkin ME, Jimenez JL, Kiendler-Scharr A, Maenhaut W, McFiggans G, Mentel TF, Monod A, Prevot ASH, Seinfeld JH, Surratt JD, Szmigielski R, Wildt J, 2009. The formation, properties and impact of secondary organic aerosol: current and emerging issues. Atmospheric Chemistry and Physics 9, 5155–5236. [Google Scholar]
- Han JH, Kamber M, Pei J, 2012. Data Mining: Concepts and Techniques, Morgan Kaufmann, Waltam, Massachusetts. [Google Scholar]
- Healy RM, Temime B, Kuprovskyte K, Wenger JC, 2009. Effect of relative humidity on gas/particle partitioning and aerosol mass yield in the photooxidation of p-xylene, Environmental Science and Technology 43, 1884–1889. [DOI] [PubMed] [Google Scholar]
- Hoffmann T, Odum JR, Bowman F, Collins D, Klockow D, Flagan RC, Seinfeld JH, 1997. Formation of organic aerosols from the oxidation of biogenic hydrocarbons. Journal of Atmospheric Chemistry 26, 189–222. [Google Scholar]
- Huang Z, 1998. Extensions to the k-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304. [Google Scholar]
- Kristensen K, Cui T, Zhang H, Gold A, Glasius M, Surratt JD, 2014. Dimers in α-pinene secondary organic aerosol: Effect of hydroxyl radical, ozone, relative humidity and aerosol acidity. Atmospheric Chemistry and Physics 14, 4201–4218. [Google Scholar]
- Kroll JH, Ng NL, Murphy SM, Flagan RC, Seinfeld JH, 2006. Secondary organic aerosol formation from isoprene photooxidation. Environmental Science and Technology 40, 1869–1877. [DOI] [PubMed] [Google Scholar]
- Kroll JH, Seinfeld JH, 2008. Chemistry of secondary organic aerosol: Formation and evolution of low-volatility organics in the atmosphere. Atmospheric Environment 42, 3593–3624. [Google Scholar]
- Lee A, Goldstein AH, Keywood MD, Gao S, Varutbangkul V, Bahreini R, Ng NL, Flagan RC, Seinfeld JH, 2006a. Gas-phase products and secondary aerosol yields from the ozonolysis of ten different terpenes. Journal of Geophysical Research: Atmospheres 111, D07302, doi: 10.1029/2005JD006437. [DOI] [Google Scholar]
- Lee A, Goldstein AH, Kroll JH, Ng NL, Varutbangkul V, Flagan RC, Seinfeld JH, 2006b. Gas-phase products and secondary aerosol yields from the photooxidation of 16 different terpenes. Journal of Geophysical Research: Atmospheres 111, D17305, doi: 10.1029/2006JD007050. [DOI] [Google Scholar]
- Liu C, Chu B, Liu Y, Ma Q, Ma J, He H, Li J, Hao J, 2013. Effect of mineral dust on secondary organic aerosol yield and aerosol size in α-pinene/NOx photo-oxidation. Atmos. Environ, 77, 781–789, doi: 10.1016/j.atmosenv.2013.05.064. [DOI] [Google Scholar]
- Liu Y, Hopke PK, 2014. A chamber study of secondary organic aerosol formed by ozonolysis of α-pinene in the presence of nitric oxide. J. Atmos. Chem, 71, 21–32, doi: 10.1007/s10874-014-9278-9. [DOI] [Google Scholar]
- Ng NL, Kroll JH, Chan AWH, Chhabra PS, Flagan RC, Seinfeld JH, 2007a. Secondary organic aerosol formation from m-xylene, toluene, and benzene, Atmospheric Chemistry and Physics 7, 3909–3922. [Google Scholar]
- Ng NL, Chhabra PS, Chan AWH, Surratt JD, Kroll JH, Kwan AJ, McCabe DC, Wennberg PO, Sorooshian A, Murphy SM, Dalleska NF, Flagan RC, Seinfeld JH, 2007b. Effect of NOx level on secondary organic aerosol (SOA) formation from the photooxidation of terpenes. Atmospheric Chemistry and Physics 7, 5159–5174. [Google Scholar]
- Offenberg JH, Lewandowski M, Kleindienst TE, Docherty KS, Jaoui M, Krug J, Riedel TP, Olson DA, 2017. Predicting thermal behavior of secondary organic aerosols. Environmental Science and Technology, 51, 9911–9919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Presto AA, Hartz KEH, Donahue NM, 2005. Secondary organic aerosol production from terpene ozonolysis. 2. Effect of NOx concentration, Environmental Science and Technology 39, 7046–7054. [DOI] [PubMed] [Google Scholar]
- Song C, Na K, Cocker DR, 2005. Impact of the hydrocarbon to NOx ratio on secondary organic aerosol formation. Environmental Science and Technology 39, 3143–3149. [DOI] [PubMed] [Google Scholar]
- Song C, Na K, Warren B, Malloy Q, Cocker DR, 2007. Secondary organic aerosol formation from m-xylene in the absence of NOx. Environmental Science and Technology 41, 7409–7416. [DOI] [PubMed] [Google Scholar]
- Wang J, Doussin JF, Perrier S, Perraudin E, Katrib Y, Pangui E, Picquet-Varrault B, 2011. Design of a new multi-phase experimental simulation chamber for atmospheric photosmog, aerosol and cloud chemistry research. Atmos. Meas. Tech, 4, 2465–2494, doi: 10.5194/amt-4-2465-2011. [DOI] [Google Scholar]
- Wang X, Liu T, Bernard F, Ding X, Wen S, Zhang Y, Zhang Z, He Q, Lü S, Chen J, Saunders S, Yu J, 2014. Design and characterization of a smog chamber for studying gas-phase chemical mechanisms and aerosol formation. Atmos. Meas. Tech, 7, 301–313, doi: 10.5194/amt-7-301-2014. [DOI] [Google Scholar]
- Ziemann PJ, Atkinson R, 2012. Kinetics, products, and mechanisms of secondary organic aerosol formation. Chem. Soc. Rev, 41, 6582–6605. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



