Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 16.
Published in final edited form as: IEEE Access. 2020 Aug 14;8:147738–147755. doi: 10.1109/ACCESS.2020.3013108

Signal Processing Methods to Interpret Polychlorinated Biphenyls in Airborne Samples

RYAN A MCCARTHY 1, ANANYA SEN GUPTA 1, BERNICE KUBICEK 1, ANDREW M AWAD 2, ANDRES MARTINEZ 2, RACHEL F MAREK 2, KERI C HORNBUCKLE 2
PMCID: PMC7742762  NIHMSID: NIHMS1621481  PMID: 33335823

Abstract

The main contribution of this interdisciplinary work is a robust computational framework to autonomously discover and quantify previously unknown associations between well-known (target) and potentially unknown (non-target) toxic industrial air pollutants. In this work, the variability of polychlorinated biphenyl (PCB) data is evaluated using a combination of statistical, signal processing, and graph-based informatics techniques to interpret the raw instrument signal from gas chromatography-mass spectrometry (GC/MS/MS) data sets. Specifically, minimum mean-squared techniques from the adaptive signal processing literature are extended to detect and separate coeluted (overlapped) peaks in the raw instrument signal. A graph-based visualization is provided which bridges two complementary approaches to quantitative pollution studies: (i) peak-cognizant target analysis (limits data analysis to few well-known compounds) and (ii) chemometric analysis (statistical large-scale data analysis) that is agnostic of specific compounds. Further, peak fitting techniques based on L2 error minimization are employed to autonomously calculate the amount of each PCB present with a normalized mean square error of −18.4851 dB. Graph-based visualization of associations between known and unknown compounds are developed through principal component analysis and both fuzzy c-means (FCM) and k-means clustering techniques are implemented and compared. The efficiency of these methods are compared using 150 air samples analyzed for individual PCBs with GC/MS/MS against traditional target-only techniques that perform analysis across only the known (target) PCBs. Parameter optimization techniques are employed to evaluate the relative contribution of PCB signals against ten potential source signals representing legacy signatures from historical manufacture of Aroclors and modern sources of PCBs produced as by products of pigment and polymer manufacturing. Aroclors 1232, 1254, 1016, and 1221 as well as non-Aroclor 3, 3’, dichlorobiphenyl (PCB 11) were found in many of the samples as unique source signals that describe PCB mixtures in air samples collected from Chicago, IL.

Keywords: Identifying Sources, Interpreting GC/MS/MS, PCBs, Signal Processing

I. INTRODUCTION

RECENT years have seen a surge in interdisciplinary research [1]–[6] combining signal processing and related analytical techniques for interpretation of environmental data sets [2]–[4], [6]–[14]. Statistical techniques [3], [6]–[13] have been employed successfully in interpretation of polychlorinated biphenyl (PCB) data. PCBs are a set of 209 bioaccumulating, persistent, and toxic compounds that are widely found in the environment worldwide. The 209 possible PCB compounds are referred to as congeners belonging to ten sets of PCB homolog isomers each with the same molecular mass. The relative concentrations of each PCB congener measured in active air samples varies as a function of proximity to sources, their specific physicochemical properties, meteorological conditions, and historical use [15], [16]. Although not intentionally produced today, PCBs are byproducts to certain manufacturing processes, still present throughout the environment, and pose multiple health risks to those exposed to them [17]–[20].

PCBs have many different sources which are affected by microbial and environmental processes that change the relative mass of each congener. In a formative study from 1997, Frame analyzed commercial PCB mixtures called Aroclors for their specific PCB content [21]. Each Aroclor has a unique signal which can be used to determine the product or process producing these PCBs. This identification is useful for exposure management and remediation. This work hypothesizes that the evaluation of the raw chromatographic signal will uncover information about the environment that would not be detected from target analysis of the individual PCBs. Specifically, minimum mean-squared error techniques are extended in combination with adaptive signal processing which has been proposed in recent literature [22]–[28] to detect, separate, and analyze coeluted peaks within the raw signal to bridge the gap between peak-cognizant target analysis and statistical chemometric analysis.

This paper is divided into six sections. The remainder of this section describes the instruments used, current-state-of-the-art methods, and key contributions of this work in associating dominant peaks to hidden peaks within the signal as well as sampling locations. Section II describes the collection and extraction of sample data. Section III presents the peak fitting procedures of the signals and Section IV describes the association techniques implemented within this work. Finally, Sections V and VI discusses and compares the results produced from this work and presents conclusions.

A. INSTRUMENT DESCRIPTION AND PREPARATION OF THE RAW INSTRUMENT SIGNAL

The instrumentation used for generating the data used in this work identifies PCBs through gas chromatography-mass spectrometry (GC/MS/MS) in multiple reaction monitoring mode (MRM). This provides selective signal separations of all the known (target) PCBs by their homolog isomer for each sample [29], [30]. The selective signal separations in this work are exploited by fitting curves to the peaks to identify target PCBs within each sample, detailed in Section III. Direct peak interpretation is difficult due to retention time shifts of target PCBs on the chromatograms and non-linearity from sample to sample. To overcome this challenge, PCBs are typically measured in environmental samples by comparing the signal of a calibration standard solution run through GC/MS/MS in MRM mode of target PCB content against that of the prepared environmental sample [31], [32]. The raw instrument signal after this pre-processing is an information-rich signal representative of PCBs in the environment at the time the sample was collected.

B. CURRENT STATE-OF-THE-ART METHODS

Signal processing techniques employing constrained optimization techniques for peak extraction have been exhaustively researched in beamforming and sonar localization literature [33]–[37] as well as other applications e.g. real-time brain activity and heart rate monitoring [38], [39], to name a few. Signal processing techniques have also been employed for raw instrumental signal interpretation to provide better analysis in determining environmental pollutants [40], [41]. However, despite these recent computational advances, peak-cognizant raw signal interpretation beyond target compounds remains an open challenge, particularly for studying toxic air pollutants such as PCBs. Further discussion is provided below.

C. TARGET ANALYSIS VS. CHEMOMETRIC ANALYSIS

The raw instrument signal from gas-chromatographic and mass spectrometric instruments carries a wealth of information on the composition of complex mixtures [42]–[47]. However, most chemical analysis and expert interpretation of the raw signal is target-based (e.g. [42] and references in [44]) i.e., focused only on the contribution of target compounds whose chemical properties are well-known, and which occupy specific positions in the retention time of the instrument signal. Target analysis, while extremely important and relevant to interpret the dominant or known part of the instrument signal, provide limited opportunity to exploit the full informational power that sophisticated analytical hardware can offer. For example, hundreds if not thousands, of non-target compounds that manifest as unknown peaks within the raw instrument signal can provide hitherto unforeseen knowledge of environmental pollutants within passive air samples.

Raw signal analysis itself is not new. The rich and growing field of chemometrics [48]–[50] already provide many statistical techniques to analyze the peaks within the raw instrument signal on a large scale. However, purely statistical methods are compound-agnostic and as such, provide insight into the aggregate behavior of the raw signal, e.g. dominant trends in a principal component analysis (PCA) [48]. Aggregate studies are useful to understand broader trends but, as yet, are not designed to detect compound-specific information, particularly from unknown toxic contaminants that can be buried in the larger statistical behavior of the raw signal (e.g. hidden against more dominant targets, or aligned along less dominant PCA components). This is an important distinction against the currently available techniques in both environmental chemistry and statistical methods; as currently, no technique exists that can discover and disentangle the signature of highly toxic yet unknown contaminants which chemists do not look for and which peak-agnostic multivariate chemometric analysis fail to detect. There is, therefore, a compelling need to bridge the gap between purely target-driven methods, as pursued by chemists that provide in-depth knowledge of a few well-known compounds, and purely statistical methods, which are compound-agnostic.

The GC/MS/MS MRM raw signal interpretation routinely excludes many of the non-targeted analytes found in the samples thus eliminating key connections between non-target analytes and target PCBs in an environmental sample. In this work, non-target analytes are defined as chemicals that have gone through the GC/MS/MS and appear as peaks in the chromatograms in more than 50% of the samples but are not in the calibration solutions. A more comprehensive data interpretation can be achieved without such filtering; however, these non-target analytes are traditionally ignored to improve detection and identification of target PCBs within each sample.

Current-state-of-the-art analysis of PCBs using GC/MS/MS focus solely on target PCBs and ignore other potential co-indicators of PCB sources. Presently, modeling GC/MS/MS and calculating the contribution of sources have produced many algorithms to aid in the process. Listed next are approaches from the current art that are well known and have offered important advancements in the calculation of PCBs and sources. Discussion of how the proposed computational techniques complement and potentially enhance the current art in raw signal interpretation can be found throughout the manuscript.

1). MODELING GC/MS/MS -

Recent computational techniques proposed include algorithms that model raw gas chromotagrophic signals, e.g. PARAFAC and PARAFAC2 [7], [8], [14], [51]. PARAFAC uses a N-way PCA decomposition method assuming low-rank N-linearity which breaks down the array into sets of scores and loadings that are mainly unique estimates of the underlying peaks in the data. Further, using an alternating least squares algorithm, the model fits the curve of the analyte in the total ion chromatogram (TIC) data set [14], [51]. While PARAFAC can locate peaks in a sample, it struggles to detect the retention time shifts of the peaks from sample to sample and can overlook analytes that are small peaks in the data [7], [51]. In PARAFAC2 however, the data does not require low-rank linearity and can allow deviation in the data. This decomposition similarly breaks down the array into a set of scores and loading which are unique estimates of the underlying data and uses similar procedures as PARAFAC [7], [8], [51]. Although PARAFAC2 fixes the retention time shift problem by using a time loading matrix for each sample and a one-component model, it is still challenged in determining the number of components required for the peak, identifying PCBs, and finding other reoccurring chemicals within the chromatographic data sets [8], [51]. The motivation in this work is to complement such modeling approaches and employ robust signal processing techniques that glean as much peak information as possible against the ambient noise in the raw signal. This enables robust joint and compound-cognizant interpretation of target and non-target peaks from the raw signal.

2). IDENTIFYING SOURCES –

Analysis to identify sources can be done using linear regression models [9]–[13] or even positive matrix factorization (PMF) [10]–[13], [52]–[55]. These algorithms attempt to solve for the various percentages of sources through linear equations. Although these algorithms pose unique solutions to the problem, they are limited in their estimations. Linear regression models are limited to linear relationships and is sensitive to outlier data when calculating mixture percentages, and PMF requires source weights that can influence the outcomes of the percentages.

D. BACKGROUND MOTIVATION

PCBs are frequently detected in different environmental compartments such as air, water and sediment, and even in human serum, and provide a well-established basis to compare different samples. The interpretation of target PCBs and sources can be greatly enhanced by incorporating non-target analytes found within the GC/MS/MS topography. While recent chromatography interpretations of GC/MS/MS data sets have proposed various methods to locate target PCBs, approaches to finding non-target analytes are rare. Moreover, contributions of various analytical techniques to target PCBs have partial limitations and would benefit from deeper analysis.

E. KEY CONTRIBUTIONS

The scope of this work is to automate and enhance target-centric raw signal processing such that the end result is a compound-cognizant graph-based peak profile. The value of the work lies in automating the process to avoid human confirmation bias in peak selection, while also allowing human interpretation using the peak-cognizant graph visualization as well as statistical clustering analysis presented in this work. This approach connects peak-specific interpretation, as is commonly done in traditional target analysis and related peak-mapping efforts [42]–[47], to purely chemometric interpretation [48]–[50]. Furthermore, as noted in Section VG and related discussion for Table 1, the technique is capable of isolating non-target peaks that may coelute or elute in close proximity to target peaks. Through these techniques, as when applied to target-centric raw signals, such as presented here, the contribution of non-targets can be isolated and quantified. Such non-target identification can detect chemical threats from toxic contaminants, e.g. introduced into the environment by hostile agents or as by products of unknown sources, which would otherwise remain undiagnosed in routine target and regulatory analysis. This is particularly applicable to toxins that are similar in chemical composition and retention time to known targets, and hence will be captured in the raw GC-MS signal, but only as non-targets which may coelute or elute in close proximity.

TABLE 1.

Cross-Comparison between target PCBs retention times found autonomously through the described technique and expert-validated target PCBs retention times provided from previous work [29] as ground truths. The full definition of terms are provided in Section VA

results Manual positive Manual negative
Autonomous positive 23411 2519
Autonomous negative 454 16

More specifically, the aim of this work is to propose and test novel combinations of peak fitting PCB chromatograms, applying principal component analysis (PCA) and both k-means and c-means clustering, L2 minimization calculations to analyze sources and mixtures of contaminants, and provide a further examination of signals more topologically to develop deeper analysis of the data. An important distinction between the technique presented here and other methods is the potential to discover hidden peaks within the samples through automating (peak-cognizant) detection and interpretation while preserving the identity of target peaks within the signal. In this case, the peaks within the MRM data sets are fitted and target peaks are autonomously identified based on the calibrations and their retention times. Once performed, hidden peaks are identified from the remaining peaks within the MRM or TIC signals. The objective is to extend the scope of target PCB analysis to include detection of non-target peaks within the various samples and better identify sources of PCBs in the sample locations. While target peaks dominate the GC/MS/MS signal, the unutilized contribution of non-target peaks can also be employed to distinguish related samples. Employing these techniques and methods will significantly enhance the already well established knowledge of PCBs and sources through deeper analysis of GC/MS/MS data. Further, these techniques can relate compound-cognizant target analysis with compound-agnostic and purely statistical approaches to create an in-depth dictionary of underlying information hidden within the signals.

II. DESCRIPTION OF DATA

A. DATA

The data set originated from 150 air samples collected with active high-volume air samplers (Hi-Vols) deployed across the Chicago metropolitan area from 2007 to 2009 [29]. The data for this paper were generated by instrument analysis as follows: sample extracts were analyzed using a GCMS/MS (Agilent 7000 Triple Quad with Agilent 7890A GC and Agilent 7693 autosampler equipped with a Supelco SPB-Octyl capillary column) in multiple-reaction monitoring (MRM) mode [18], [20]. Analytical quality control included surrogate standards recoveries, replicates, laboratory and field blanks, and standard reference material. The MRMs produce twelve chromatograms for each sample, representing the chromatographic signal for different mass transition ions (10 transitions for unlabeled PCBs and 2 for mass-labeled PCB standards). Further, one total ion chromatogram (TIC) is obtained for each sample, representing the combined MRM signals. The same MRM and TIC were obtained for the calibration solutions containing 209 PCB congeners. Theoretically, each peak within the TIC and MRM signal corresponds to a PCB congener or, in a lesser extent, to a non-target chemical found in the sample. Heights or relative intensity were used as the total amount of the PCB congener or chemical found. The heights of the peaks in the calibration solutions were used for calculating the mass of each congener through computation of the relative response factor (RRF) [29].

B. SOFTWARE

The algorithms and analysis done in this work were developed with the MATLAB R2018a software (The Math-works, Inc. USA). The following toolboxes were installed with the MATLAB 2018a software to implement the algorithms: Fuzzy Logic toolbox, Optimization toolbox, and Statistics and Machine Learning toolbox. The MRM and TIC chromatographic signals were first manually adjusted for appropriate and consistent baseline using Agilent’s software MassHunter (Version B.06.00, ©Agilent Technologies, Inc).

III. FITTING SIGNALS PROCEDURE

The total ion chromatogram (TIC) signal obtained for each sample represents the linear superposition of the B combined multiple reaction monitoring mode (MRM) signals. This is expressed as:

T[x]=i=1BHi[x] (1)

Where T[x] and H[x] are the TIC at the xth time instance and the ith MRM signal, respectively. The MRM raw signal specifically isolates individual groups, and therefore has higher-precision compound-specific information, though it does not convey the total contribution of all the chemicals captured in the TIC signal. Therefore, to autonomously capture the individual contributions of different contaminants within a Hi-Vol sample, it is imperative to analyze the available MRM raw signals for different compound groups. This enables derivation of the individual peak heights, corresponding to individual PCB congeners (target peaks), as well as significant non-target peaks that contribute towards the aggregated TIC signal. This section highlights, describes, and presents the procedure in the order used to analyze the raw MRM signals.

A. SHIFTING MRM SAMPLE SIGNALS

Changes in the GC/MS/MS column temperature, dimension, or carrier gas linear velocity through the GC/MS/MS columns cause retention time shifts which can impact the analysis of identifying PCB congeners in the sample signals [41]. To alleviate this issue and aid in identifying PCB congeners, the signals are shifted to align the peaks in the MRM sample signals. The peaks are aligned by identifying the standards, the largest peaks in the signal, within the samples and adjusting the retention times to the standards in the calibration samples. The adjusted TIC signal, T[x], is expressed as:

T[x]=i=1BHi[xf] (2)

Where f is the difference in retention time from each MRM’s respective standard to the calibration’s standard.

B. DETERMINING PEAK MAXIMA AND LOCAL MINIMA

To remove the noise from each MRM sample signal, the maximum signal value, smax, is determined in each of the twelve MRM chromatograms. The background noise floor is selected as τsmax where any part of the raw signal with amplitude sτsmax is used for peak detection (detailed in Fig. 1). The background noise threshold is selected empirically as τ = 5 × 10−5. The threshold is chosen by examining the previously analyzed MRM data and determining an average baseline value; this corresponds to a maximum signal-tonoise ratio (SNR) of ∼ 43dB for the maximum value of the raw signal, i.e., treating smax as the signal amplitude. The choice of maximum SNR allowed robust detection of any peaks, target or non-target, that fall within 43 dB of the highest peak within the raw signal. After the noise removal step, the pseudo-code given in Fig. 1 Step 1 is used to determine PCB peaks, where x is the index corresponding to the retention time on the MRM. A peak spread threshold of 20 indices is used to ensure determination of a distinct peak.

FIGURE 1.

FIGURE 1.

Pseudo-code for peak detection with the raw signal.

C. APPLYING A CURVE FIT FOR PEAKS

To obtain the minima, local minimum heights were identified by using the original MRM sample signals for twelve different compound groups [29] and finding the smallest relative intensity in the signal relative to nearby points seen in Fig. 1 Step 2. A threshold of 70 indices is implemented to minimize the capture of local minimums within larger peaks.

To eliminate further noise in the peaks and evaluate the contribution of the peak to the signal, H[x], a best fit cosine curve is applied to the signal peaks (example seen in Fig. 2). A cosine model is used since it provided a superior goodness of fit to other popular peak shapes such as Gaussian models. To computationally calculate the portion of each peak the maxima and minima are first determined using the pseudo-code and specific steps detailed in Fig. 1. There are two cases when modeling peaks within the raw signal, described below in detail.

FIGURE 2.

FIGURE 2.

Example of cosine peak used for modeling peaks. α adjusts the height of the cosine peak, β varies the width of the left side of the peak and ζ varies the width of the right side of the peak.

1). Case 1: Single Peak, Non-Coeluted Peaks

In this scenario, a peak is identified at xmax between two minima at xmin1 and xmin2, where xmin1 < xmax < xmin2, using the pseudo-code depicted in Fig. 1. Because peaks within the signals can be asymmetrical, each half of the peak is fitted separately. The signal between xmin1 and xmin2 is defined as s[x]. Half of the peak, rβ[x], is fitted as:

rβ[x]=α(cos(βθ)+12) (3)

Where θ = −π, ..., 0, α varies the height of the peak and is initialized to s[xmax], and β varies the width of the half peak and is initialized to the distance between s[xmin1] and s[xmax]. The other half of the peak, rζ[x], is fitted using:

rζ[x]=α(cos(ζθ)+12) (4)

Where θ = 0,...,π, α is the same as (3), and ζ determines the width of the half peak and is initialized to the distance between s[xmax] and s[xmin2]. The modeled peak, R[x], is written as:

R[x]=rβ[x]+rζ[x] (5)

Equation (5) is optimized by minimizing the objective function for the best fit curve, P[x], as:

P[x]=i=1ns[x]Ri[x]22 (6)

2). Case 2: Multiple Peaks: Coeluted Peaks

When there are multiple peaks around the same retention time identified using the pseudo-code depicted in Fig. 1, it is possible to separate them using a similar technique as before. Identifying ϱ peaks between two minima indicates a coelution of peaks within the signal. To determine the contributions of each peak to s[x], each peak is fitted using (3)-(5). There are three cases in which the peak fitting is initialized:

  1. Peak lies between a local minimum and another peak, i.e. xmin1 < xmax1 < xmax2, the β in (3) is initialized as described in case 1 and ζ in (4) as the distance between xmax1 and xmax2.

  2. Peak lies between a peak and local minima, i.e. xmax2 < xmax3 < xmin2, the ζ in (4) is initialized as described in case 1 and β in (3) as the distance between xmax1 and xmax2.

  3. Peak between two peaks, i.e. xmax1 < xmax2 < xmax3, the β is initialized as the distance between xmax1 and xmax2 and ζ as the distance between xmax2 and xmax3 in (3) and (4) respectively.

To optimize the fitted curve, P[x], and contribution of each peak to s[x], use the following:

P[x]=i=1nsi[x]j=1ϱRij[x]22 (7)

where j denotes the jth fitted peak in the raw signal. Equation (7) thus optimizes the overall fit of each of the peaks to the coeluted signal, s[x], and determines each peak’s contribution.

D. DISCOVERING COELUTED PEAKS

Because of the properties of the chromatographic column used in the GC/MS/MS, some PCBs that go through the GC/MS/MS will reach the detector at the same retention time and create further coelution of peaks. To better identify these coeluted peaks, the N fitted curves detailed in Section IIIC., P[x], are subtracted from the original raw signal, H[x], to get a new signal, A[x], containing hidden peaks. This is done by using:

A[x]=H[x]j=1NPj[x] (8)

where j denotes the jth fitted curve in the raw signal. If there are hidden non-target peaks, they can be identified in the new signal A[x]. Section III AD is repeated using A[x] as the new raw signal to determine if any peaks were unseparated (coeluted) within the signal. The procedure of this technique can be seen in Fig. 3. To demonstrate how well fitted this model is, one sample’s twelve modeled MRM signals were added to produce a TIC sample signal seen in Fig. 4.

FIGURE 3.

FIGURE 3.

Procedure of fitting peaks within the signal. Once all the peaks have been fitted, the fitted curves are subtracted from the raw signal and fit the resulting peaks in the new signal (denoted in the green box). Once all the peaks and coeluted peaks have been determined, target peaks are determined from the calibration signals. Once the target peaks have been identified, non-target peaks that occur greater than 50% of the signals are found over the sample signals.

FIGURE 4.

FIGURE 4.

Fit of the TIC data set. The blue line is the TIC signal T[x] and the red line is the linear summation of fitted cosine peaks in each MRM samples.

IV. AUTOMATIC DETECTION AND INTERPRETATION OF RAW INSTRUMENT SIGNAL

The peak model presented in Section III is used to autonomously detect hundreds of target and non-target peaks from the raw instrument signal and interpreted their associations using a combination of statistical and geometric clustering techniques.

A. DETERMINING PCB CONGENER

A PCB was identified in the MRM sample if the peak’s retention time was closest to a PCB’s peak retention time in the same MRM calibration solution’s chromatogram (within a 0.07-minute range). After a target PCB was determined, the PCB’s modeled peak was stored in the constructed TIC signal. For each target PCB found in each of the MRM sample signals, the fitted peak was shifted back to its original position to match the original MRM sample signal.

B. DETERMINING NON-TARGET ANALYTES

Implementing the peak model from Section III AD, non-target analytes in the MRM and TIC data sets are then analyzed. After the target PCBs are determined, a new signal is reconstructed for the MRM data sets using only these target PCBs. The signal is subtracted from the original data sets, theoretically uncovering peaks that are not identified as PCBs. After applying this approach to all the data sets, the newly found peaks are compared to each other to determine non-target peaks. The non-target peaks are determined by first finding the difference in retention times of the uncovered peaks to the peak of the closest target PCB. Then, the difference is used to search all 150 samples to find similarities. If a peak is found present more than 50% of the time, it is labeled as a potential non-target analyte within the data. This method is applied to both MRM and TIC data sets.

C. PRINCIPAL COMPONENT ANALYSIS (PCA)

After determining the PCB peak height values found in each sample and normalizing using the internal standard heights of both the calibration solutions and samples, principal component analysis (PCA) is implemented to find relationships within the peak heights of the data. In this case, PCA is particularly useful to determine correlations from similar instances of peaks within the samples. The data matrix V, comprising of all the identified target peak heights for each sample, can be described by the product between the scores matrix G and the transpose, T, of the loading matrix L with an added residual matrix E.

V=GLT+E (9)

Equation (9) produces the principal components which are a linear combination of the original variables. The principal components are ordered according to the amount of variance explained in V, i.e. principal component 1 represents the dominant variation while principal component 2 represents the second most. For this work, the scores matrix G is used for clustering and is plotted in the first two or three principal components to visualize the associations within the data.

D. K-MEANS CLUSTERING

The k-means algorithm implements an unsupervised machine learning algorithm to associate a data set within an N dimensional space into k clusters. These clusters are created by using k centroids to assign the data to each cluster. The choice of k was determined based on empirical observations and the elbow method (see Appendix B). The k-means objective function minimized is shown in (10).

JM=j=1ki=1ngicj2 (10)

Where k is the number of clusters, n is the number of data-points, g is the ith data-point in the scores matrix G, and c is the centroid for cluster j. Each value is assigned a cluster using:

c(i)=argminjgicj2 (11)

The jth centroid location, cj, is updated using:

cj:=i=1mδ(cj=i)gii=1mδ(cj=i) (12)

Where δ = 1 if gi belongs to the jth cluster. Equation (12) is iterated until Equation (10) converges on a local or global minima. Because of the high variability between runs of the k-means clustering, this technique is implemented multiple times to find an average centroid location to consistently cluster the data and determine associations (Fig. 5). From the resulting clusters, PCB congeners or sample locations are associated with each other in multiple principal components to find connections between the clustering of PCBs or locations to potential sources.

FIGURE 5.

FIGURE 5.

Procedure of PCA, k-means and c-means clustering.

E. FUZZY C-MEANS CLUSTERING (FCM)

The fuzzy c-means (FCM) algorithm is used in conjunction with the k-means algorithm to provide validation to the clustering parameters chosen. The FCM objective function minimized is shown in (13).

JM=i=1nj=1kμi,jmgicj2 (13)

Where n is the total number of data-points, k is the total number of clusters, μi,j is the calculated degree of membership of the ith data-point to the jth cluster, m is the fuzzy partition matrix exponent, gi is the ith data-point, and cj is the jth centroid location. The fuzzy partition matrix exponent is chosen and explained further in Appendix A. Similar to kmeans, this is an iterative algorithm, and starts by assigning random degrees of memberships to the data-points. The jth centroid, cj is calculated and updated with Equation (14).

cj:=i=1nμi,jmgii=1nμi,jm (14)

Where the same variable definitions hold. After centroid location calculation the degree of membership matrix, μi,j, is calculated and updated using Equation (15).

μi,j=1l=1k(gicjgicl)2m1 (15)

The FCM is iteratively run until a maximum number of iterations has been updated or the objective function improves less than a specified threshold, whichever occurs first. Due to the variability in resultant centroid locations, the FCM algorithm is performed multiple times and an overall average of the centroid locations are used for association determination (Fig. 5). After calculating the average centroid locations, the final degree of membership is calculated from Equation (15), where the maximum degree of membership is used to assign a cluster number to the data-point.

F. ACCURATELY DETERMINING SOURCES

Geographical locations can have different sources of PCBs thus it is imperative to identify and localize these sources using different techniques. Previously, sources have been identified by either using linear regression models [9]–[13] or even PMF [10]–[13], [52]–[55] on other data sets. In this work, sources are identified by correcting the mass of each PCB congener using the calibration solutions, RRF, and surrogate standard recoveries for each sample, then finding the mass fraction of the PCB congeners in each sample. The percentages of Aroclor sources, γ, of PCBs are identified using an L2 minimization optimization technique that considers PCB mass fractions and minimizes the error, ϵ. The percentages of Aroclor sources is calculated using:

ϵ=minγ((Dγb)2) (16)

Where D is the mass fraction contribution of PCBs in each Aroclor mixture (found in [21]), γ are the factors (sources) represented as columns of Aroclor data that are being calculated (restricted to γ ≥ 0), and b is the mass fraction of the PCBs found in a sample. Equation (16) can also be written as:

γ=(DTD)1DTb (17)

This technique is implemented with MATLAB’s fmincon function to parameterize the results given. Although reference [21] provides feasible sources, non-Aroclor mixtures present and other potential mixtures not considered that are not present are of interest to this study. Referring to references [17]–[20], new factors are added to D to calculate the percentages of each mixture. To ensure random mixtures are also being considered, a Monte Carlo approach is employed to calculate an average percentage of each mixture used.

G. RELATIVE RETENTION TIME PROXIMITY

Relative retention time proximity is a useful criterion to associate peaks that closely elute and therefore, may be useful to associate within and beyond PCA clusters. A node edge graph visualizes the relationship between PCBs with respect to their retention time that are harder to determine by just examining the TIC or MRM signal. This relation can be useful in identifying closely eluted PCBs and identifying peaks as PCB targets. Fig. 10 is created using the average retention time location of each PCB congener across all samples. The relative proximity of each target PCB within the TIC signal is found by implementing a nearest neighbor technique. This step allows for a better understanding of the location of PCB targets within the TIC signal. These graphs are shown to demonstrate the co-occurrence associations between PCBs and not as knowledge graph informatics as [56].

FIGURE 10.

FIGURE 10.

Graph of detected target PCBs within the GC/MS/MS data sets. The nodes represent PCBs with colors corresponding to their cluster color, and the edges represent a connection to another PCB that has retention time closest to it (a). The retention times used are the average retention time found throughout each of the data sets. Found within (a) are two cases: 1) where the PCBs all have the same cluster color (b) and 2) where the PCBs are all closest to each other but are within different clusters (c).

V. RESULTS AND DISCUSSION

The resulting fitted peaks described in detail throughout Section III was used for the clustering analysis performed throughout the results and discussion, Section V. The peak fitting procedure had an average normalized mean square error of −18.4851 dB to the signal.

A. CROSS-COMPARISON BETWEEN AUTONMOUSLY DETECTED TARGET PEAKS AND EXPERT-ANNOTATED TARGET PEAKS

This work aims to automate the time-consuming process of identifying target PCB peaks and mapping them to known target compounds. Therefore, it is imperative to provide quantitative comparisons between what our algorithm finds against expert-annotated ground truths validated over the same data. The ground truth for retention times for each of the 209 target PCBs considered in this work are based on existing calibration standards documented in [29]. Therefore, based on the documented retention times for each target peak in the standards, any identified peaks can be mapped to specific target PCB. Additionally, for each sample, ground truths are further established based on manually executed expert validation of whether or not a peak was visually identified at the stipulated retention time. Depending on the PCB composition of the sample, a particular peak may (or may not) be present in the raw signal, as all PCBs may not be present in detectable concentrations in every sample.

Table 1 compares the target PCBs identified with this algorithmic technique to what was manually identified in [29]. A raw signal peak identified manually or using our autonomous technique based on Equations (3)(7) is labeled “positive” if a PCB is also identified within (±)0.07 minutes of the listed retention time in the calibration standard, where 0.07 minutes is the sampling time period of the raw signal. Otherwise, we assign the label “negative” to a detected peak, which cannot be mapped to a target PCB in the standard; either due to the peak falling outside the 0.07 minutes range or for being an undocumented non-target peak. Similarly, the labels “manual” and “autonomous” are respectively assigned for peaks detected by an expert, as detailed in [29], or autonomously identified using the signal processing techniques detailed in this work. Specifically, the four possible labels are described as below:

  1. Manual negative: A PCB peak is identified in the standard but the manual inspection cannot find the peak in the sample;

  2. Manual positive: A PCB peak is identified in the standard and the manual inspection can find the peak in the sample;

  3. Autonomous negative: A target PCB peak is identified in the standard but the algorithm does not detect the peak in the raw signal for the sample;

  4. Autonomous positive: A target PCB peak is identified in the standard and the algorithm also detects the peak in the raw signal for the sample.

Each entry of Table 1 shows the number of raw signal peaks that meet the criteria for the corresponding row and column labels. For example, the entry “Manual positive/Autonomous positive” indicates that 23411 raw signal peaks were autonomously identified across the full dataset which also matched with the ground truth of manually validated target PCB peaks.

The results of Table 1 can be summarized as: 98.1% of manually determined target peaks could also be detected as target PCB peaks by our autonomous method. On the other hand, 90.3% of autonomously detected target peaks, i.e., peaks that occurred within ±0.07 minutes of a listed target PCB in the calibration standard, could be matched with target peaks that were manually determined over the raw signal sample. Therefore, the proposed algorithm discovered 2519 peaks over the whole data that corresponded to a target PCB based on the standard but were missed in manual inspection. We also observe that 16 raw signal peaks, listed as target PCBs in the standard sample, were not found both by manual inspection or the proposed autonomous method. Therefore, Table 1 provides validation that our method autonomously identifies target peaks from the raw signal with 98.1% accuracy, while identifying extra target peaks missed by manual detection. Any peak-allocation error in the technique is attributed to any residual baseline noise in the MRM data sets or larger retention time shifts that can be accounted for by the raw signal sampling interval. These autonomously identified peaks, which are mapped into specific target PCBs, provide the basis for automated peak-cognizant interpretation that most autonomous chemometric studies cannot offer. Reproducing this rich peak-cognizant information manually, annotated as specific target PCBs, and visualized based on their relative proximity as in Fig. 10, will be overwhelmingly expensive in expert personnel time and subject to human bias.

For the PCA visualization, the peak heights were determined for each PCB and these values were normalized based on their surrogate standards and were standardized sample to sample. The scree plots are shown in Fig. 6 and plots of the data in the principal components are seen in Fig. 7 and Fig. 8. To better visualize the clusters and plotted data in Fig. 7 and Fig. 8, the first two principal components were plotted in Fig. 7 and first three principal components were plotted in Fig. 8.

FIGURE 6.

FIGURE 6.

Scree plots of target PCB (a) and sample location (b) data. (a) The majority of the variance explained is found within the first 3 principal components. (b) The majority of the variance explained if found within the first 5 principal components. For this work, only the principal components that contribute to ∼ 95% of the variance explained are considered in clustering.

FIGURE 7.

FIGURE 7.

First 2 principal components of the target PCB data are plotted. (a) k-means clustering of target PCB data with 7 clusters. (b) FCM clustering of target PCB data with 7 clusters and fuzzy exponent, m, as 1.2. Both clustering techniques offer unique solutions to the data, however, FCM clusters the larger group of data into separate clusters, i.e. cluster 1 and cluster 2, and clusters two outlier points that are far apart together, i.e. cluster 7.

FIGURE 8.

FIGURE 8.

First 3 principal components of the sample location data are plotted. (a) k-means clustering of sample location data with 5 clusters. (b) FCM clustering of sample location data with 5 clusters and fuzzy exponent, m, as 1.2. Both clustering techniques offer unique solutions to the data, however, FCM clusters two outlier points into the larger groups, i.e. cluster 5, and cluster 2 separates closely related data.

B. COMPARISON OF K-MEANS AND FCM CLUSTERING

To associate the data presented in this work and cluster either PCBs or sampling locations, both k-means and FCM clustering are examined. FCM is implemented with a choice of the fuzzy partition exponent, m, in Equation (13) as 1.2 (discussed in Appendix A). The optimal number of centroids for PCB data and sampling locations is 7 and 5 respectively (discussed in Appendix B). Although FCM considers all points in the data for each centroid to make a soft decision, the computational time of the technique is slower. Further, the clusters created using FCM does not partition the points into distinct clusters and can have difficulty finding optimal associations. Seen in Fig. 7 and Fig. 8, the larger group of points is clustered differently and includes outlier points within the relatively closer groups of points as compared to k-means clustering. In Fig. 7 (a), the k-means clustering technique clusters the larger group of data together while in Fig. 7 (b), FCM splits the larger group into separate clusters. Although this provides more separation of data, the c-means technique clusters outliers together with the larger group and can provide misleading associations. In Fig. 8 (a), the k-means technique clusters outlier data as their own clusters and clusters the larger group of data into 3 separate clusters. In Fig. 8 (b), the FCM technique clusters the larger group of data into separate clusters and associates outlier data within the larger groups providing incorrect associations of the data. From the above analysis, k-means clustering is chosen and plotted in the results for the rest of this work.

C. PCA OF TARGET PCBS

This section is focused on the target PCBs and their contribution throughout each sample clusters. PCA and clustering of the PCBs across all the samples are plotted in Fig. 7 (a) where the first two principal components make up roughly 50% of the variance explained. This plot shows clusters of potential significance across the samples that could be related to sources. Applying this concept, major PCBs are identified that are contributing as outliers to overall PCB profiles such as PCB 3 and PCB 5 which form their own separate respective clusters. Further, when non-target analytes are identified this method is used to make associations with Aroclors aiding in fingerprinting the pollutants.

D. PCA OF GEOGRAPHICAL SAMPLING LOCATIONS

Clustering geographical sampling locations is based on target PCBs and plotted in Fig. 8 (a) where the first three principal components consist of 45% of the variance explained. Further, the color of the cluster is plotted for each of the samples on a map to get a better topological view seen in Fig. 9. Although seasonal clustering was not a particular focus of this work, it was noticed that all the winter samples clustered together independent of the location (Cluster 2) while all other seasons were scattered throughout the various clusters. With samples of the same season that were collected around the same time, this method can produce associations between various locations to identify sources based on proximity of their geographical sampling location.

FIGURE 9.

FIGURE 9.

The Chicago land area map labeled with sample clusters. The cluster coloring comes from the PCA plot in Fig. 8 (a).

E. RETENTION TIME PROXIMITY

Cluster analysis identifies two distinct groups of target PCB retention time proximity shown in Fig. 10. Fig. 10 (b) shows a subgraph that only contains PCBs that cluster in the same group and Fig. 10 (c) shows PCBs that clustered in completely different groups. This provides two different ways of interpreting the data. In the former, it means that the general proximity of the PCBs in retention time may not be coeluting as much or that these PCBs together show up with the same relative intensity throughout all the samples. The latter case demonstrates that either the PCBs are coeluting and that some of the actual concentration may be in another PCB or that these PCBs appear close to one another and further analysis of peaks that show up around the larger PCB can be based on the other PCB. Although these are two different ways of analyzing this, it provides a better understanding of the locations of the PCBs in the chromatogram and identification of PCBs in the TIC data more topologically.

F. DETERMINING SOURCES

The percentage of Aroclor only contribution to each location was determined and displayed in Fig. 11 using Equation (16). Focusing on the two largest percentages of Aroclors in each location, Aroclors 1232, 1254, 1016, and 1221 were highly present (Fig. 11 (a)). Further, with added uncertainty to Equation (16), there is a large contribution of 3, 3’, dichlorobiphenyl (PCB 11) and Aroclors 1232, 1254, 1016, and 1221 as shown in Fig. 11 (b). Aroclors 1254, 1016, and 1221 were produced and sold by Monsanto for use in products such as capacitors, adhesives, and rubbers [15]. These Aroclors can be primarily seen as products used in construction throughout Chicago, however, Aroclor 1232 was not expected to be an important source. Aroclor 1232 was sold in small quantities compared to other Aroclors and is only found in a few products such as hydraulic fluids or adhesives that could have similar contributions as Aroclors 1221 and 1254 [15], [16]. Environmental weathering by microbial dechlorination, environmental distillation, and atmospheric reaction may have changed the PCB mixtures present in Chicago air to resemble Aroclor 1232. In addition to detection of Aroclors, PCB 11 is present in each sample. PCB 11 is a current byproduct of pigment manufacturing and was not present in the Aroclor mixtures sold, and now banned, more than 40 years ago [17], [30]. It has a lower molecular weight and is more volatile than most of the PCBs present in Aroclors which could explain its large relative abundance. Because paint is applied to surfaces in thin coats, PCB 11 likely volatilizes into the air more efficiently than heavier PCB congeners. This is also seen in [19].

FIGURE 11.

FIGURE 11.

Two highest percentages of Aroclors in the geographical locations are counted excluding random other mixture. (a) Top 2 Aroclor only percentages are considered from each sample using target PCBs peak heights and Equation (9) to determine percentages (Appendix C). (b) Top 2 Aroclor only percentages are considered from each sample using target PCBs peak heights and Equation (9) with contributions of each PCB to each Aroclor considered from [8] to determine percentages (Appendix D). Further, different mixtures of PCB only and random mixtures of PCBs are considered. In both (a) and (b), Aroclor 1016 is one of the most present Aroclors across all sampling locations.

G. NON-TARGET PCB PEAKS

Although no non-target peaks of significant size were identified within the MRM data, there were hidden non-target peaks that coeluted with target PCBs in the TIC data set. Fig. 12 shows representative non-target peaks that appeared in the TIC signal after deleting the modeled TIC signal with color coding corresponding to cluster colors in Fig. 13. The non-target peaks thus isolated are significant from a data analysis perspective. This is because the TIC signal is generated from archived samples that are traditionally filtered to screen out chemicals that are not PCBs. Such target-selective chemical filtering is a standard laboratory protocol in a majority of environmental studies and most publicdomain data archives assume such target-selective filtering has been successfully performed. However, based on the raw TIC signal analysis from such a representative data archive it appears that some non-target chemicals still pass through as undetected contaminants. These contaminants, while potentially closely associated with target analytes, coelute with the target compounds as hidden TIC peaks, and normally would not be detected or accounted for in any target-based or statistical analysis. Therefore, while the motivation for raw signal analysis is to find target and non-target compounds, from a purely traditional target-oriented perspective, such hidden non-target contaminants discovered in the TIC signal are significant for two reasons: (i) to test using GC/MS/MS raw signal analysis whether the laboratory protocols can indeed screen out most non-target analytes in the samples as typically desired in target-oriented studies, and (ii) to validate whether computational techniques, such as those proposed here, can indeed discover hidden non-target analytes that coelute with target PCBs.

FIGURE 12.

FIGURE 12.

Plot of one of the sample TIC data sets after the subtraction of target PCB peaks. A few of the potentially found non-target peaks within the TIC data set are pictured and are indicated by the stars at the location they were found. The coloring is based on PCA and k-means clustering.

FIGURE 13.

FIGURE 13.

(a) First 3 principal components plotted for target PCBs and non-target analytes found within the TIC. Clustering analysis performed with k-means clustering with 6 clusters. (b) Scree plot of target PCBs and non-target analytes found in TIC signal data. Over 50% of the variance explained is explained within the first 3 principal components.

H. PCA OF TARGET PCBS AND NON-TARGET ANALYTES

For this section, both the target PCBs identified before and the non-target analytes identified in the TIC data set were used. Examples of a few non-target analytes identified can be seen in Fig. 12. PCA and clustering of PCBs and non-target analytes are plotted in Fig. 13. The first three principal components make up roughly 60% of the variance explained. This plot shows clustering of target PCBs and non-target analytes across the samples that could impact source discoveries. Although a large amount of the non-target analytes clustered as outliers, some were clustered with target PCBs like PCB 5 and PCB 1. This is a significant finding as these non-targets, which closely associate with target PCBs, would otherwise never be included in traditional contaminant studies. Although this technique was implemented for PCBs and non-target analytes, this same approach can be used for specific sample locations. The idea behind this technique is to find non-target analytes and target PCBs that associate together to better identify sources of PCBs in the sample locations. This approach was only implemented for non-target analytes in the TIC data set because there were more significant peaks in the TIC than the MRM data sets.

VI. CONCLUSION

This work proposes a novel combination of various computational techniques to automate peak-cognizant detection and interpretation of PCBs found within GC/MS/MS data sets. Specifically, peak modeling and L2 error minimization techniques are employed to autonomously detect target and previously undetected non-target peaks from the raw instrument signal. Then, a combination of PCA and k-means clustering techniques are employed to isolate groups of PCB congeners that are potentially associated with each other to demonstrate how they manifest in the environment. Individual contributions of Aroclors across a diverse portfolio of Chicago air samples are isolated. Utilizing these techniques and concepts can aid in discovering and interpreting all the information inherent within the GC/MS/MS signal. This type of comprehensive and quantitative analysis is valuable to environmental science in two significant ways:

  1. By design, these developed techniques are not biased towards target contaminants, which are typically employed in traditional GC/MS/MS interpretation, and therefore, can be used to discover unknown contaminants that might prove critical to air pollution studies.

  2. These novel techniques are peak-based, and therefore compound-cognizant, unlike purely statistical chemometric methods [7], [8]. This approach allows interpretation of large-scale statistical results based on PCA and k-means clustering at the level of individual compounds. Therefore, the methods can bridge the gap between compound-agnostic statistical interpretation and compound-specific (target-based) studies [7]–[13], [17]–[21], [29], [52]–[55].

In summary, the major science return of these techniques is connecting target compounds (known PCB congeners) with potentially significant but previously unknown non-target compounds and allow comprehensive automated (peak-cognizant) analysis of raw GC-MS (and combinations thereof) signals across large data repositories. While this work reports the findings across 150 active air samples, this technique has the potential to be applied across much larger scales of data and repositories.

FIGURE 16.

FIGURE 16.

Percentage of each Aroclor contributing to the PCB values found for each location. Only Aroclors are considered in [8].

ACKNOWLEDGMENT

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research was partially funded by the University of Iowa Center for Health Effects of Environmental Contamination (CHEEC), University of Iowa and Iowa Superfund Research Program (NIEHS P42 ES 013661), and the National Science Foundation (1808463).

This research was funded by the University of Iowa Center for Health Effects of Environmental Contamination (CHEEC) and University of Iowa, Iowa Superfund Research Program (NIEHS P42 ES 013661), and the National Science Foundation (1808463)

Biography

graphic file with name nihms-1621481-b0017.gif

RYAN A. MCCARTHY earned a dual B.A. degree in Engineering Physics and Scandinavian Studies from Augustana College, Rock Island, IL in 2017. He is currently pursuing a Ph.D. in Electrical and Computer Science Engineering at the University of Iowa, Iowa City, IA. From 2014 to 2015 he worked in the Research and Development department at Miner Enterprises Inc. In 2016 he was an Engineer at Crawford Company. Since 2017, he has been a research assistant with the University of Iowa Electrical and Computer Science Engineering department. His research interests include applications of signal processing, underwater acoustics, and data analysis. Mr. McCarthy became a member of Sigma Pi sigma in 2016 and a member of Phi Beta Kappa in 2017. He was named to the Hampshire Society of the National Football Foundation in 2017.

graphic file with name nihms-1621481-b0018.gif

ANANYA SEN GUPTA (M) is an Associate Editor of IEEE Access, a guest editor of IEEE Journal of Oceanic Engineering Special Issue in “Underwater Acoustic Propagation Physics and Signal Processing Techniques for Shallow Water Acoustic Communications” and a Technical Committee member in IEEE Ocean Engineering Society. She received her MS (2001) and PhD (2006) from University of Illinois at Urbana-Champaign.

From 2008 to 2012 she was a Postdoctoral Scholar and Researcher at Woods Hole Oceanographic Institution working in undersea signal processing and petroleum forensics. Since 2013, she has been an Assistant Professor with the Electrical and Computer Engineering Department, University of Iowa, Iowa City. Dr. Sen Gupta’s research interests lie in the nexus of signal processing, pattern recognition and knowledge discovery, with emphasis on applications to environmental chemistry, underwater acoustics and space plasma physics. She seeks to develop geometric computational techniques that enable sophisticated representation, localization, tracking, and classification of raw instrument signals generated by diverse environmental contaminants, laboratory conditions and natural phenomena. Her algorithms have been applied to shallow water acoustic communications, fingerprinting oil spills, sonar target recognition in high-clutter coastal environments as well as tracking high-energy plasmospheric events on Earth and Mars.

Dr. Sen Gupta currently leads an interdisciplinary research team of graduate and undergraduate students, several of whom have received multiple student research awards under her mentorship. Her research for her Iowa EPSCoR project was recently featured in the ISGC 2015–2016 STIMULI EPsCOR report distributed to Congress. She has also received a teaching award in 2015 and three mentor awards from the Iowa Space Grant Consortium in 2016 and 2017.

graphic file with name nihms-1621481-b0019.gif

BERNICE KUBICEK earned a dual B.S. in Mechanical Engineering and Electrical Engineering from the Milwaukee School of Engineering, Milwaukee, WI in May of 2019. She is currently studying for her Ph.D. in Electrical and Computer Engineering at the University of Iowa, Iowa City, IA. In 2017 she worked as a Process Engineer Intern at Hentzen Coatings Inc., in Milwaukee WI. From 2018 to 2019, she worked as an Electrical Design Engineering Intern at Astronautics Corporation of America in Milwaukee WI. Bernice has been a research assistant at the University of Iowa Electrical and Computer Engineering department since June of 2019, her research interests include active sonar, various feature extraction techniques, and applications of signal processing.

graphic file with name nihms-1621481-b0020.gif

ANDREW M. AWAD received his M.S. Environmental Science, B.S. Physics and Astronomy, B.A. Music from the University of Iowa. He is a Project Engineer with cGMP Consulting. He is currently working with a large bio-pharmaceutical manufacturer leading projects to track production processes and streamline data collection. He has 5 years of scientific research experience, with an emphasis in the analysis of PCBs and their chemical by products in environmental air, soil, and sediment samples.

graphic file with name nihms-1621481-b0021.gif

ANDRES MARTINEZ is an Assistant Research Engineer at IIHR-Hydroscience Engineering and Adjunct Assistant Professor in the Department of Civil & Environmental Engineering at the University of Iowa. He is a graduate of the University of Iowa (Ph.D., Environmental Engineering), Imperial College, England (M.S., Environmental Technology) and Pontificia Universidad Catolica de Valparaiso, Chile (B.S., Biochemical Engineering). He has nearly 10 years of scientific research experience, during which he has developed expertise in the areas of field sampling, development of analytical method and passive sampling devices, and analysis of organic compounds such as PCB in Complex Environmental matrices, environmental modeling, and data analysis. He has more than 20 peer review papers in high impact scientific journals.

graphic file with name nihms-1621481-b0022.gif

RACHEL F. MAREK received a B.A. in chemistry from Grinnell College and a Ph.D. in environmental engineering from the University of Iowa where she was a US Department of Education GAANN Fellow. As a graduate student in 2013 she won the C. Ellen Gonter Graduate Student Paper Award from the American Chemical Society. She is an Assistant Research Scientist at IIHR-Hydroscience & Engineering at the University of Iowa and a researcher with the Iowa Superfund Research Program. Her research includes sources and fate of environmental contaminants such as PCBs, siloxanes, and pesticides and their breakdown products in abiotic and biotic environmental matrices and whether people, especially children, are exposed to these harmful chemicals. Dr. Marek is a member of the American Chemical Society, the Society for Environmental Toxicology and Chemistry, and the International Association for Great Lakes Research.

graphic file with name nihms-1621481-b0023.gif

KERI HORNBUCKLE is the Donald E. Bently Professor of Engineering in the Department of Civil & Environmental Engineering and Research Engineer at IIHR-Hydroscience and Engineering at the University of Iowa. She is a graduate of the University of Minnesota (Ph.D., environmental engineering and science), and Grinnell College (B.A., chemistry). She is the Director of the Iowa Superfund Research Program, a research center funded by the National Institute for Environmental Health Sciences, an Associate Editor of the American Chemical Society journal, Environmental Science and Technology, and an expert on the sources and transport of polychlorinated biphenyls (PCBs), synthetic fragrances, perfluorinated compounds, siloxanes, current use and legacy pesticides, and other persistent organic pollutants.

VII. APPENDIX A

A. FCM CLUSTERING CERTAINTY AND FUZZY PARTITION EXPONENT

Equations (13)(15) depend primarily on the fuzzy partition exponent, m, to update the cost function and centroid location within each cluster. The fuzzy partition exponent dictates how fuzzy the results will be and often can skew the results determined by the relative inter-distance of the data-points. While choosing a random m may yield results, further observations into the exponent m is plotted in Fig. 14. An important observation to note is the increasing m value causes the uncertainty of points within a cluster to increase. Further, the number of clusters impacts the cluster certainty because the clusters will be close together making it difficult to distinguish the optimal cluster for the point to belong to. For this work, a smaller m value is implemented to ensure clustering of neighboring data.

FIGURE 14.

FIGURE 14.

Average degree of membership of points to a cluster as a function of cluster number and fuzzy partition matrix exponent, m. (a) shows c-means clustering for the sample locations while (b) shows the clustering for target PCBs. As the fuzzy partition matrix exponent, m, increases the average cluster certainty decreases (average cluster uncertainty increases) since there are more options for a data-point to belong too.

VIII. APPENDIX B

A. DETERMINING K CLUSTERS

For practical purposes, the sum of euclidean distances from each point within each cluster is considered and plotted to determine the optimal choice of k clusters. Mathematically, the sum of euclidean distances from each point within each cluster is expressed as:

d=i=1kgjKicigj22 (18)

Where g is the set of points in the scores matrix G, K is the set of k clusters, and ci is the center of cluster i. Summing across multiple replications of similar number of clusters measures the variability of the points within cluster and describes how compact the clusters are within the data. To confirm the correct number of clusters used, d is plotted for different number of clusters implemented. Seen in Fig. 15 is the optimal choice of clusters for the scores matrix, G, of both PCB and and sample location data. The choice of clusters is based on empirical observations and the slope of the plot in Fig. 15

FIGURE 15.

FIGURE 15.

Plot of sum of euclidean distance of points within a cluster to their respective centroid vs number of clusters, k, using the iterative process depicted in Fig. 5. The sum of euclidean distances were averaged by implementing k-means and fuzzy c-means clustering 20 unique times for each k.

IX. APPENDIX C

A. AROCLOR PERCENTAGE PRESENCE

Contributions of each Aroclor are determined using Equation (17) for each sampling location. The mass fraction contribution matrix, D, is determined by normalizing the PCBs found in the samples using internal standards to the calibration data. The relative response factor (RRF) of each PCB is determined using the normalized peak heights to the calibration data. The percents of only Aroclor contributions for each sampling location are determined using [5]. The two highly present Aroclors found across all sampling locations are plotted in Fig. 11 and discussed further in Section VE.

X. APPENDIX D

A. MIXTURE PERCENTAGE PRESENCE

Contributions of each Aroclor are determined using Equation (17) for each sampling location. The mass fraction contribution matrix, D, is determined by normalizing the PCBs found in the samples using internal standards to the calibration data. The relative response factor (RRF) of each PCB is determined using the normalized peak heights to the calibration data. The percents of Aroclor and other mixture contributions for each sampling location are determined. While Aroclor only contributions values can be found in [5], other PCB only contributions and random mixtures of varying PCBs are considered to determine mixture percentages. The two highly present mixutres found across all sampling locations are plotted in Fig. 11 and discussed further in Section VE.

XI. APPENDIX E

A. NOMENCLATURES

Acronym Definition

PCB Polychlorinated Biphenyl
Congener One of the 209 well defined chemical
PCB compounds
Aroclor Mixture of well known PCB congeners with unique signals
RRF Relative Response Factor
GC/MS/MS Gas Chromatography-Mass Spectrometry
PCA Principal Component Analysis
TIC Total Ion Chromatogram
MRM Multiple Reaction Monitoring
Hi-Vols High-Volume Air Samplers
SNR Signal-to-Noise Ratio
FCM Fuzzy c-means Clustering
PMF Positive Matrix Factorization

XII. APPENDIX F

A. MRM MASS TRANSITIONS (M/Z)

Cl Homolog Precursor Ion Product Ion

1 188.0 152.0
2 222.0 152.0
3 258.0 186.0
4 291.9 222.0
5 325.0 255.9
6 359.8 289.9
7 393.8 323.9
8 429.7 259.8
9 463.7 393.8
10 497.7 427.9

REFERENCES

  • [1].Lee J and Park TH and Kang HS and Lim S, ”Miniaturized gas chromatography module with micro posts embedded MEMS column for the separation of exhaled breath gas mixtures,” 2016 IEEE SENSORS, vol. 37, no. 2, pp. 1–3, Oct. 2016. [Google Scholar]
  • [2].Gupta AS and Meyer B and Overton E. ”Quantifying weathering profiles of environmental contaminants from marine and coastal oil spills using signal processing techniques,” OCEANS 2018 MTS/IEEE Charleston, pp. 1–4, Apr. 2018. [Google Scholar]
  • [3].Skarysz A and Alkhalifah Y and Darnley K and Eddleston M and Hu Y and McLaren DB and Nailon WH and Salman D and Sykora M and Thomas CLP and Soltoggio A, ”Using Capillary Gas Chromatography to Determine Polychlorinated Biphenyls (PCBs) in Electrical Insulating Liquids,” 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, Jul. 2018. [Google Scholar]
  • [4].Zubir NSA and Abas MA and Ismail N and Ali NAM and Rahiman MHF and Mun NK and Taib MN and Saiful NT, ”Pattern classifier of chemical compounds in different qualities of agarwood oil parameter using scale conjugate gradient algorithm in MLP,” 2017 IEEE 13th International Colloquium on Signal Processing its Applications (CSPA), pp. 18–22, Mar. 2017. [Google Scholar]
  • [5].Fu Y, Wan X, Zhang X, Fang G, and Yi J, ”Side Peak Interference Mitigation in FM-Based Passive Radar Via Detection Identification,” IEEE Transactions on Aerospace and Electronic Systems, vol. 53, no. 2, pp. 778–788, Apr. 2017. [Google Scholar]
  • [6].Johnson GW, Ehrlich R, Full W, and Ramos S, ”Principal Components Analysis and Receptor Models in Environmental Forensics,” Introduction to Environmental Forensics (Third Edition), USA, Acad. Press, 2015, ch. 18, pp. 609–653. [Google Scholar]
  • [7].Amigo JM, Skov T, Coello J, Maspoch S, and Bro R, ”Solving GC-MS Problems with Parafac2,” Trends in Analytical Chemistry, vol. 27, no. 8, pp. 714–725, Sep. 2008. [Google Scholar]
  • [8].Amigo JM, Popielarz MJ, Callejon RM, Morales ML, Troncoso AM, Peterson MA, and Toldam-Anderson TB, ”Comprehensive Analysis of Chromatographic Data by Using Parafac2 and Principal Components Analysis,” Journal of Chromatography A, vol. 1217, no. 26, pp. 4422–4429, Jun. 2010. [DOI] [PubMed] [Google Scholar]
  • [9].Corbella R, Rodriguez-Delgado MA, and Garcia Montelongo FJ, ”Contribution to the Identification and Quantitation of Aroclor Mixtures by Least Squares Analysis of Gas Chromatographic Data,” Journal of Chromatographic Science, vol. 36, no. 7, pp. 372–378, Jul. 1998. [Google Scholar]
  • [10].Risso F, Magherini A, Ottonelli M, Magi E, Lottici S, Maggiolo S, Garbarino M, and Narizzano R, ”A Comprehensive Approach to Actual Polychlorinated Biphenyls Environmental Contamination,” Environmental Science and Pollution Research, vol. 23, no. 9, pp. 8770–8780, May 2016. [DOI] [PubMed] [Google Scholar]
  • [11].Zhang M and Harrington PB, ”Automated Pipeline for Classifying Aroclors in Soil by Gas Chromatography/Mass Spectrometry Using Modulo Compressed Two-Way Data Objects,” Talanta, vol. 117, pp. 438–491, Dec. 2013. [DOI] [PubMed] [Google Scholar]
  • [12].Ma CY and Bayne CK, ”Differentiation of Aroclors Using Linear Discrimination for Environmental-Samples Analyzed by Electron-Capture Negative-Ion Chemical Ionization Mass-Spectrometry,” Analytical Chemistry, vol. 65, no. 6, pp. 772–777, Mar. 1993. [Google Scholar]
  • [13].Karcher SC, Small MJ, and Vanbriesen JM, ”Statistical Method to Evaluate the Occurence of PCB Transformations in River Sediments with Application to Hudson River Data,” Environmental Science and Technology, vol. 38, no. 24, pp. 6760–6766, Dec. 2004. [DOI] [PubMed] [Google Scholar]
  • [14].Bro R, ”PARAFAC. Tutorial and Applications,” Chemometrics and Intelligent Laboratory Systems, vol. 38, no. 2, pp. 149–171, Oct. 1997. [Google Scholar]
  • [15].Faroon O, Syracuse Research Corporation, and Olson J, ”Toxicology Profile for Polychlorinated Biphenyls (PCBs),” U.S. Department of Health and Human Services, Nov. 2000. https://stacks.cdc.gov/view/cdc/6480/cdc_6480_DSI.pdf [Google Scholar]
  • [16].Kopp T, ”PCBs in the United States Industrial Use and Environmental Distribution,” EPA, contract 6801–3259, Addr. Washington D.C. USA, Feb. 1976. https://nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=20001275.TXT [Google Scholar]
  • [17].Hu D and Hornbuckle KC, ”Inadvertant Polychlorincated Biphenyls in Commercial Paint Pigments,” Environmental Science and Technology, vol. 44, no. 8, pp. 2822–2827, Apr. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Herkert NJ, Jahnke JC, and Hornbuckle KC, ”Emissions of Tetrachlorbiphenyls (PCBs 47,51, and 68) form Polymer Resin on Kitchen Cabinets as a Non-Aroclor Source to Residential Air,” Environmental Science and Technology, vol. 52, no. 9, pp. 5154–5160, Apr. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Hites RA, ”Atmospheric Concentrations of PCB-11 Near the Great Lakes Have Not Decreased Since 20004,” Environmental Science and Technology Letters, vol. 5, no. 3, pp. 131–135, Feb. 2018. [Google Scholar]
  • [20].Marek RF, Thorne PS, Herkert NJ, Awad AM, and Hornbuckle KC, ”Airborne PCBs and OH-PCBs Inside and Outside Urban and Rural U.S. Schools,” Environmental Science and Technology, vol. 51, no. 14, pp. 7853–7860, Jun. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Frame GM, ”A Collaborative Study of 209 PCB Congeners and 6 Aroclors on 20 Different HRGC Columns 2. Semi-Quantitative Aroclor Congener Distributions,” Fresenius Journal of Analytical Chemistry, vol. 357, no. 6, pp. 714–722, Mar. 1997. [Google Scholar]
  • [22].Morgan MA and McDaniel BW, ”Transient electromagnetic scattering: data acquisition and signal processing” IEEE Transactions on Instrumentation and Measurement, vol. 37, no. 2, Jun. 1988. [Google Scholar]
  • [23].Peabody K, Husain A, Tang MH, and Macek Z, ”Digital Signal Processing Aids Cholesterol Plaque Detection,” 1995 International Conference on Acoustics, Speech, and Signal Processing, May 1995. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=480446 [Google Scholar]
  • [24].Madhow U, ”MMSE Interference Suppression for Timing Acquisition and Demodulation in Direct-Sequence CDMA Systems,” IEEE Transactions on Communications, vol. 46, no. 8, pp. 1065–1075, Aug. 1998. [Google Scholar]
  • [25].Mirbagheri A, ”Linear MMSE Receivers for Interference Suppression & Multipath Diversity Combining in Long-Code DS-CDMA Systems,” 2003. [Google Scholar]
  • [26].Uhlich S and Yang B, ”MMSE estimation in a linear signal model with ellipsoidal constraints,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3249–3252, May 2009. [Google Scholar]
  • [27].Simsa J, ”Linear Adaptive Blind MMSE Detectors for DS – CDMA Signals,” 2007 17th International Conference Radioelektronika, Jun. 2007. [Google Scholar]
  • [28].Li H and Djuric PM, ”MMSE Estimation of Nonlinear Parameters of Multiple Linear/Quadratic Chirps,” IEEE Transactions on Signal Processing, vol. 46, no. 3, pp.796–800, Mar. 1998. [Google Scholar]
  • [29].Hu D, Lehmler H, Martinez A, Martinez K, and Hornbuckle KC, ”Atmospheric PCB Congeners Across Chicago,” Atmospheric Environment, vol. 44, no. 12, pp. 1550–1557, Apr. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Awad AM, Martinez A, Marek RF, and Hornbuckle KC, ”Occurence and Distribution of Two Hydroxylated Polychlorinated Biphenyl Congeners in Chicago Air,” Environmental Science and Technology Letters, vol. 3, no. 2, pp. 47–51, Jan. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].U.S. EPA, ”Method 8082A (SW-846): Polychlorincated Biphenyls (PCBs) by Gas Chormatography, Revision 1,” 2007. https://www.epa.gov/sites/production/files/2015-12/documents/8082a.pdf
  • [32].U.S. EPA, ”Method 1668b Chlorincated Congeners in Water, Soil, Sediment, Biosolids, and Tissue by HRGC/HRMS,” 2008. https://nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=P1005EUE.TXT
  • [33].Baldacci A and Haralabus G, ”Signal Processing for an Active Sonar System Suitable for Advanced Sensor Technology Applications and Environmental Adaptation Schemes,” 2014 14th European Signal Processing Conference, Sep. 2006. https://ieeexplore.ieee.org/abstract/document/7071237 [Google Scholar]
  • [34].Knight WC, Pridham RG, and Kay SM, ”Digital Signal Processing for Sonar,” Proceedings of the IEEE, vol. 69, no. 11, pp. 1451–1506, Nov. 1981. [Google Scholar]
  • [35].Haykin S and Kosko B, Intelligent Signal Processing, IEEE Press, USA: (2001). [Google Scholar]
  • [36].Atkins PR, Collins T, and Foote KG, ”Transmit-Signal Design and Processing Strategies for Sonar Target Phase Measurement,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 1, pp. 91–104, Jun. 2007. [Google Scholar]
  • [37].Wei Z, Huang J, and Hui Y, ”Adaptive-Beamforming-Based Multiple Targets Signal Separation,” 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Oct. 2011. [Google Scholar]
  • [38].Ni Z, Wang L, Meng J, Qiu F, and Huang J, ”EEG Signal Processing in Anesthesia Feature Extraction of Time and Frequency Parameters,” Procedia Environmental Sciences, vol. 8, pp. 215–220, 2011. [Google Scholar]
  • [39].Ahmadi AK, Moradi P, Malihi M, Karimi S, and Shamsollahi MB, ”Heart Rate Monitoring During Physical Exercise Using Wrist-Type Photoplethysmographic (PPG) Signals,” 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Aug. 2015. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7319800 [DOI] [PubMed] [Google Scholar]
  • [40].Jain RK, Moura JMF, Kontokosta CE, ”Big Data + Big Cities: Graph Signals of Urban Air Pollution [Exploratory SP]”, IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 130–136, Sep. 2014. [Google Scholar]
  • [41].Rood D, ”Gas Chromatography Problem Solving and Troubleshooting,” Journal of Chromatographic Science, vol. 35, no. 136, pp. 239–240, May 1997. [PubMed] [Google Scholar]
  • [42].Ligon WV and May RJ ”Target Compound Analysis by Two-Dimensional Gas Chromatography-Mass Spectrometry,” Journal of Chromatography A, vol. 294, pp. 77–86, 1984. [Google Scholar]
  • [43].Sen Gupta A, Reddy CM, and Nelson R, “Systems and methods for topographic analysis,” U.S. Patent 8,838,393, issued Sept. 2014.
  • [44].Damavandi HG, Sen Gupta A, Nelson RK, and Reddy CM, 2016. ”Interpreting comprehensive two-dimensional gas chromatography using peak topography maps with application to petroleum forensics,” Chemistry Central Journal, vol. 10, no. 75, pp. 1–14, Nov. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Bruflodt R, Nelson RK, Arrington EC, Valentine D, Sen Gupta A, Lemkau K, Kivenson V, and Reddy CM, ”Fingerprinting the Refugio Oil Spill Using Topographic Signal Processing of Two-Dimensional Gas Chromatographic Images,” OCEANS 2017-Anchorage, pp. 1–4, Sept. 2017. [Google Scholar]
  • [46].Sen Gupta A, Meyer B, and Overton E, ”Quantifying Weathering Profiles of Environmental Contaminants from Marine and Coastal Oil Spills Using Signal Processing Techniques,” OCEANS 2018 MTS/IEEE Charleston, pp. 1–4, Oct. 2018. [Google Scholar]
  • [47].Damavandi HG, Sen Gupta A, Canahuate G, Reddy CM, and Nelson R, ”Robust Oil-spill Forensics and Petroleum Source Differentiation using Quantized Peak Topography Maps,” July 2018, arXiv preprint arXiv:1807.07484. [Google Scholar]
  • [48].Jolliffe IT, Principal Component Analysis, Hoboken, Wiley, 2002. [Google Scholar]
  • [49].Demiriz A, Bennett KP, Breneman CM and Embrechts MJ, ”Support Vector Machine Regression in Chemometrics,” Computing Science and Statistics, Proceedings of the 33rd Symposium on the Interface 2001. [Google Scholar]
  • [50].Howley T, Madden MG, O’Connell M-L, and Ryder AG, ”The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data,” Knowledge Based Systems, vol. 19, no. 5, pp. 363–370, Sept. 2006. [Google Scholar]
  • [51].Kamstrup-Nielsen M, Johnsen L, and Bro R, ”Core Consistency Diagnostic in PARAFAC2,” Journal of Chemometrics, vol. 27, no. 5, pp. 149–171, May 2013. [Google Scholar]
  • [52].Rodenburg LA and Ralston DK, ”Historical Sources of Polychlorinated biphenyls to the Sediment of the New York/New Jersey Harbor,” Chemosphere, vol. 169, pp. 450–459, Nov. 2017. [DOI] [PubMed] [Google Scholar]
  • [53].Du S, Belton TJ, and Rodenburg LA, ”Source Apportionment of Polychlorinated Biphenyls in the Tidal Delaware River,” Environmental Science and Technology, vol. 42, no. 11, pp. 4044–4051, Apr. 2008. [DOI] [PubMed] [Google Scholar]
  • [54].Rodenburg LA, Du S, Fennell DE, and Cavallo GJ, ”Evidence for Widespread Dechlorincation of Polychlorincated Biphenyls in Groundwater, Landfills, and Wastewater Collection Systems,” Envrionmental Science and Technology, vol. 44, no. 19, pp. 7534–7540, Sep. 2010. [DOI] [PubMed] [Google Scholar]
  • [55].Rodenburg LA, Du S, Xiao B, and Fennel DE, ”Source Apportionment of Polychlorinated Biphenyls in the New York/New Jersey Harbor,” Chemosphere, vol. 83, no. 6, pp. 792–798, Apr. 2011. [DOI] [PubMed] [Google Scholar]
  • [56].Nickel M, Murphy K, Tresp V, and Gabrilovic E, ”A Review of Relational Machine Learning for Knowledge Knowledge Graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, Jan. 2016. [Google Scholar]

RESOURCES