In silico machine learning–enabled detection of polycyclic aromatic hydrocarbons from contaminated soil

Yilong Ju; Oara Neumann; Sara B Denison; Peixuan Jin; Andres B Sanchez-Alvarado; Peter Nordlander; Thomas P Senftle; Pedro J J Alvarez; Ankit Patel; Naomi J Halas

doi:10.1073/pnas.2427069122

. 2025 May 8;122(19):e2427069122. doi: 10.1073/pnas.2427069122

In silico machine learning–enabled detection of polycyclic aromatic hydrocarbons from contaminated soil

Yilong Ju ^a, Oara Neumann ^b,^c, Sara B Denison ^d, Peixuan Jin ^e, Andres B Sanchez-Alvarado ^c,^f, Peter Nordlander ^c,^g, Thomas P Senftle ^e, Pedro J J Alvarez ^d, Ankit Patel ^a,^b,^h,¹, Naomi J Halas ^b,^c,^f,^g,¹

PMCID: PMC12088428 PMID: 40339117

Significance

Soil contamination by environmental pollutants, particularly polycyclic aromatic hydrocarbons (PAHs), can significantly affect human health due to their carcinogenic and mutagenic properties. In this work, we present an approach that integrates theoretical spectral calculations with machine learning to identify PAHs, an approach that can be extended straightforwardly to the thousands of lesser-known and virtually unstudied environmental pollutants that also pose public health risks. By extracting characteristic spectral features and training a detection model to differentiate between contaminated and as-collected/reference soil samples, this work offers a scalable solution to address widespread environmental health issues.

Keywords: polycyclic aromatic hydrocarbons, SERS, machine learning, polycyclic aromatic compounds

Abstract

The detection and identification of polycyclic aromatic hydrocarbons (PAHs) and their modified derivatives in contaminated soil is challenging due to the chemical and microbial complexity of soil organic matter. To address these challenges, we developed an innovative analytical approach that combines Surface-enhanced Raman spectroscopy with a Raman spectral library constructed in silico using density functional theory (DFT)-calculated spectra. This method overcomes several limitations associated with traditional experimental libraries, including spectral background interference, solvent effects, and commercially unavailable or challenging to synthesize compounds. Our methodology employs a physics-informed machine learning pipeline that operates in two stages: the characteristic peak extraction (CaPE) algorithm, which isolates distinctive spectral features, and the characteristic peak similarity (CaPSim) algorithm, which identifies analytes with high robustness to spectral shifts and amplitude variations. Validation of this approach showed strong similarity values (>0.6) between DFT-calculated and experimental Surface-enhanced Raman spectra for multiple PAHs, confirming its accuracy and discriminative capability. This study establishes the viability of DFT-calculated spectra as reliable references for identifying analytes that lack experimental reference spectra, including those formed through environmental modification of PAHs. This advancement addresses a critical gap in environmental monitoring, providing a valuable tool for assessing public health risks associated with these contaminants.

Soil is critical to life on Earth: the reduction in soil health due to environmental contaminants is of significant global concern. Soil organic matter is considered the most complex biomaterial on our planet, consisting of a complex mixture of plant and microbial matter in various stages of decay (1). Environmental contaminants such as polycyclic aromatic hydrocarbons (PAHs) can readily associate with the hydrophobic components of soil organic matter, presenting ongoing challenges for remediation and restoration of soil fertility. PAHs and their functionalized derivatives (polycyclic aromatic compounds, or PACs) are well known for their carcinogenic and mutagenic properties and for their resultant wide variety of deleterious health effects due to contact exposure, inhalation, and ingestion (2–6). Research has also shown that PAHs strongly adsorb to clays containing certain metals, such as Fe(III) and Cu(II), which can facilitate electron transfer, leading to PAH degradation and the formation of byproducts that may also be difficult to detect (7–9).

Surface-enhanced Raman spectroscopy (SERS) has the potential to be a valuable nondestructive technique in environmental analysis due to its high sensitivity to trace amounts of chemicals (10, 11). However, SERS has its inherent challenges, such as substrate-specific variations in SERS spectra that complicate the identification of specific analytes and the presence of complex backgrounds from analyte host matrices (12). Over the past two decades, there has been increasing interest in the development of AI and Machine Learning algorithms for SERS (13). Recently, we have begun to develop physics-informed ML algorithms to address some of the critical liabilities of SERS, with the goal of realizing streamlined detection and identification of contaminants in biological and environmental samples. Our focus has been on PAHs and PACs, molecules frequently found in complex mixtures in biological and environmental host matrices (12, 14). By applying Independent Component Analysis to the SERS spectra of mixtures of PAHs, we have shown that individual analytes can be identified without physical separation, an approach we call “Computational Chromatography” (15). Applying an approach analogous to facial recognition, we showed that analytes can be identified using an experimentally compiled Raman library as ground truth, even in the presence of significant spectral shifts and peak amplitude variations induced by specific interactions between the analyte molecules and different SERS substrates (16). A physics-informed peak extraction algorithm, characteristic peak extraction (CaPE), was created and applied to isolate Raman modes in SERS spectra, providing tolerance to the spectral shifts and amplitude variations commonly encountered in SERS spectra relative to Raman spectra. Subsequently, a characteristic peak similarity metric (CaPSim) was developed and applied to quantitatively compare CaPE-processed experimental spectra to CaPE-processed theoretically calculated Raman spectra for accurate chemical identification.

Detection of PAHs from soil is particularly challenging due to the extraordinarily complex SERS spectral background created by the extensive number of types of molecules and microbes likely present in any given soil sample. Here, we address this challenge by using theoretical Raman spectra generated through density functional theory (DFT) for chemical identification, enabling detection despite spectral complexity. We compare theoretical Raman spectra with experimentally Raman spectra of the same molecules and observe that CaPE-extracted theoretical and experimental Raman spectra have a far higher degree of similarity than the analogous non-CaPE-extracted spectra. This important finding allows us to utilize theoretical Raman spectra as an in silico ground truth Raman library, important for broader classes of molecules such as the thousands of closely related PACs that may be present in environmental samples (17). CaPE was then used to process the SERS spectra of the soil extract, isolating spectral features of the PAHs that were present. The CaPE-extracted SERS spectra were then quantitatively compared to the ground truth DFT-calculated Raman spectra using CaPSim. We then compared these CaPSim results with a similarity metric that compares the non-CaPE-extracted experimental SERS spectra with the ground truth theoretical Raman spectra (NormSim). Our results demonstrate that CaPE and CaPSim consistently improve detection accuracy in complex matrices. This study not only validates the potential of CaPE and CaPSim for environmental monitoring but also offers a robust framework for applying this approach to the detection of a broader range of contaminants likely to be present in complex environmental samples.

Results and Discussions

A schematic overview of our sample preparation, PAH extraction, and the substrates for detecting PAHs using SERS is shown in Fig. 1A. The soil used in this study was collected from Harris Gully, a restored watershed and natural area on the Rice University campus, located at a significant distance from buildings, roads, and dormitories. The as-collected soil samples (Houston, TX: 29.717957, -95.395866), consisting of 43% clay and 37% sand, were then contaminated with controlled concentrations of pyrene (PYR), anthracene (ANTH), or a PYR-ANTH mixture. After contaminating the as-collected soil, the PAH-soil mixture was sealed, shaken for approximately 2 min to enhance the absorption of PAHs into the soil, and left to dry at room temperature until the acetone evaporated entirely. Acetone was selected as the solvent after evaluating alternative solvents (18, 19), including toluene, 1:1 hexane:acetone, dichloromethane (DCM), chloroform, and methanol. However, acetone offered the advantage of a simpler Raman signal background, which minimized spectral interference and facilitated clearer detection of PAHs. Subsequently, the as-collected soil sample (used as reference) and the contaminated soil underwent acetone extraction using either a simple filtration method or accelerated solvent extraction (ASE) (see Materials and Methods for details). The GC–MS chromatogram comparing acetone extractions from as-collected soil and pyrene-contaminated soil, highlighting differences in extracted compounds, has been included in SI Appendix, Fig. S1. The acetone-extracted solutions containing PAHs were then deposited onto the SERS substrate by drop-drying.

The SERS substrate consists of SiO₂ core-Au shell nanoparticles known as nanoshells (NS) (20), with an average diameter of 165 ± 17 nm. A scanning electron microscopy (SEM) image in SI Appendix, Fig. S2A shows the spherical geometry and monodispersity of the NSs. The aqueous extinction spectrum of the NSs (SI Appendix, Fig. S1B) shows a dipole plasmon resonance mode centered at 800 nm, appropriate for the 785 nm laser excitation wavelength used in the SERS measurements. The plasmon resonance wavelength and the broad plasmon linewidth ensure that the NS-based substrate provides sufficient near-field enhancement at the laser wavelength and the Stokes modes of the SERS spectra of the analytes to be detected (21).

The concentrations of extracted PAHs from the contaminated soil samples were quantified using gas chromatography–mass spectrometry (GC–MS) (Fig. 1B). Comparable concentrations of extracted PAHs were observed for both filtration (red) and ASE extraction (black) methods from the contaminated soil samples containing PYR, ANTH, and a 1:1 molar mixture of PYR: ANTH. The dashed lines in Fig. 1 B and C indicate a linear relationship between soil contamination levels and extracted PAH concentrations, which range from 1 to 600 μg/g. This underscores the effectiveness of both extraction methods for recovering PAHs, emphasizing that the filtration method, which functions efficiently at room temperature and pressure, delivers results comparable to the ASE method. This low-energy, room-temperature filtration process may present a more practical and accessible alternative for PAH extraction from soil, potentially eliminating the need for specialized high-temperature or high-pressure equipment.

A 20 µL sample of each PAH extract, obtained through filtration (Materials and Methods), was drop-dried onto a SERS substrate. Approximately 25 spectra were collected from different regions of the SERS substrate for each sample, and the average spectra are shown in Fig. 2A. The SERS spectrum of the PYR-ANTH mixture (Fig. 2 A, i) shows a linear superposition of the individual characteristic SERS peaks of PYR and ANTH. This suggests that the two PAHs do not interact chemically, but adsorb independently onto the SERS substrates, confirming the capability of this method to distinguish multiple PAHs simultaneously. The filtered solutions of 67 µM ANTH (green Fig. 2 A, ii) and 141 µM PYR (red, Fig. 2 A, iii) in acetone exhibit strong, distinctive peaks that correspond to each compound. The characteristic SERS peaks for PYR are found at approximately 408 (C–C stretch), 590 (C–C skeletal stretch), 1,250 (C–C stretch and CH in-plane bending), and 1,408 (C–C stretch/ring stretch), while ANTH displays characteristic peaks at 392 (C–C skeletal deformation), 754 (C–C skeletal stretch), and 1,539 (C–C stretch). To systematically assess the reproducibility of SERS for detecting mixed PAHs, we varied the concentrations of the PYR+ANTH mixture (SI Appendix, Fig. S3). Each concentration variation was analyzed under the same experimental conditions to eliminate external variability and to confirm that any observed differences in the SERS spectra were directly related to changes in the concentration of the analytes. The results demonstrate that the NS-based SERS substrates capture the characteristic peaks of both PYR and ANTH across all tested concentrations, confirming the sensitivity of the substrate and the potential for quantitative analysis in complex environmental samples. In contrast, the SERS spectrum of the as-collected soil mixed with acetone (shown in blue Fig. 2 A, iv) reveals complex spectral features due to the presence of multiple additional components present in the soil. For reference, the SERS spectrum of the substrate alone is shown in black (Fig. 2 A, v).

Fig. 2. — SERS analysis of acetone-extracted solutions from contaminated soil and machine learning analysis. (A) SERS spectra of an acetone solution extracted directly from a (i) (53:49 μM) PYR:ANTH mixture, (ii) 67 μM ANTH, (*iii*) 141 μM PYR, (iv) as-collected soil, and (v) a reference spectrum of the SERS substrate, all measured on Au NS-based SERS substrates. (B) Comparative spectral analysis: (a) DFT calculated Raman spectrum of PYR, (b) experimental Raman spectrum of PYR powder, and baseline-removed SERS spectra of an acetone solution extracted from (c) contaminated soil with PYR, and (d) as-collected soil.

To achieve accurate spectral analysis, we first removed the baseline to eliminate background interference from the soil matrix. This was accomplished using the adaptive, iteratively reweighted penalized least squares (airPLS) algorithm (22), which automatically fine-tunes weights during each iteration to create a smooth baseline curve. The airPLS method identified and removed the gradually changing background signals while preserving the distinct Raman peaks in the spectrum. This approach is particularly effective for SERS spectra of soil samples, where fluorescence and other matrix effects typically produce slowly varying background interference that can obscure meaningful spectral features. We refer to a spectrum after this process as the baseline-removed spectrum. Using PYR as a representative contaminant, we compared the theoretically calculated Raman spectrum using DFT (Fig. 2 B, a) and the experimental normal Raman spectrum of PYR powder (Fig. 2 B, b) against the baseline-removed spectra from PYR-contaminated soil (Fig. 2 B, c) and as-collected soil (Fig. 2 B, d). We first applied CaPE and CaPSim to investigate the use of theoretically calculated Raman spectra for PAH identification (Fig. 3). A high degree of similarity between the theoretical and the experimental Raman spectra for PAHs should permit the use of calculated Raman spectra as ground truth for PAH analyte identification. While visual inspection clearly suggests a promising agreement between the experimental and theoretical Raman spectra of PAHs, we performed a quantitative comparison.

A comparative analysis of theoretical (DFT) Raman calculations and the experimental Raman spectra of PAHs is presented in Fig. 3. (Tabulated theoretical Raman peak assignments, along with their corresponding vibrational modes, are provided in SI Appendix, Table S1.) First, the Raman spectra of PYR and ANTH were analyzed to compare DFT-calculated and experimental results. For PYR (Fig. 3A), the DFT calculations accurately predicted key vibrational modes previously identified in Fig. 2, including the peaks at 408, 590, and 1,250 cm⁻¹. Similarly, for ANTH (Fig. 3B), the DFT calculations successfully captured signature peaks at 754 and 1,539 cm⁻¹. This pattern was consistent across eight additional PAHs analyzed (SI Appendix, Fig. S4).

To evaluate the potential of DFT-calculated Raman spectra as ground truth for SERS-based PAH detection, we analyzed soil contaminated with a 1:1 PYR:ANTH mixture (Fig. 3C). The experimental SERS spectrum (Fig. 3 C, i) shows multiple peaks from both contaminants along with background signals from the soil matrix. By applying the CaPE algorithm to this spectrum, we obtain the extracted peaks shown in Fig. 3 C, ii. The DFT-calculated Raman spectra for ANTH (Fig. 3 C, iii) and PYR (Fig. 3 C, iv) serve as reference spectra for calculating similarity values using CaPSim. The gray bars highlight the matching peak positions between the mixture and reference spectra, suggesting the potential of this method to identify PAHs extracted from soil matrices. The reliability of DFT calculations for contaminant detection was further validated by comparing similarity values between DFT-calculated and experimental Raman spectra across ten different PAH molecules (Fig. 3D). We employed Pearson’s correlation as our similarity metric, whose values range from −1 to 1, where values closer to 1 indicate stronger spectral matches. For all PAHs, CaPSim consistently yields substantially higher similarity values than NormSim, with CaPSim values typically above 0.6 compared to NormSim values that fall below 0.3. These results confirm the accuracy of DFT in predicting spectral features and highlight how CaPE and CaPSim effectively focus on these well-predicted features while remaining robust to peak shifts. It is important to note that the reported similarity values represent the comparison between experimental and DFT-calculated PAH spectra. For actual detection of PAHs in soil samples, similarity values were also calculated between soil SERS spectra and DFT-calculated references. These values serve as input features for our logistic regression model aimed to determine PAH presence. Collectively, these analyses highlight the two main applications of similarity metrics in this study: validating DFT-calculated spectra as reliable reference standards and detecting PAHs in complex environmental samples.

The effectiveness of CaPSim over NormSim becomes evident when analyzing spectral similarities (SI Appendix, Fig. S5). Using a PYR-contaminated soil sample as an example, we compared both positive matching scenarios (PYR detection) and negative matching scenarios (ANTH detection). NormSim, which does not account for peak shifts, yields a much lower similarity contribution for positive matches (e.g., for the two peaks of PYR at around 1,100 cm⁻¹). Furthermore, NormSim fails to assign appropriate (low) similarity contribution values when peak overlaps exist in negative scenarios. In contrast, CaPSim provides appropriate similarity contributions at each wavenumber, generating high contributions for true peak alignments in positive matches while minimizing contributions in negative matches.

While similarity values could potentially correlate with concentration levels, our current implementation focuses on binary classification, detecting PAH presence or absence in soil samples. This design choice aligns with environmental monitoring priorities, where rapid screening for contamination typically precedes detailed quantitative analysis. Determining concentration levels would require additional calibration and validation to establish quantitative relationships between similarity scores and concentration levels for each compound.

CaPSim’s focus on characteristic peaks, rather than the entire spectrum, accounts for its superior ability to differentiate between contaminated and as-collected samples, as shown in Fig. 4. This quantitative evaluation, highlighted in Fig. 4A, directly compares the performance of CaPSim and NormSim in differentiating contaminated from as-collected/reference soil samples. The histograms present similarity values between each soil sample and the reference PAH spectrum, showing a distinct separation in CaPSim (top row) between contaminated samples (orange/pink) and as-collected samples (green/purple) for both PYR and ANTH. In contrast, NormSim (bottom row) shows significant overlap between these distributions, indicating a less effective detection capability.

Fig. 4. — Similarity value distributions and detection performance comparing CaPSim and NormSim. (A) Histograms showing the distribution of similarity values for PYR (*Left*) and ANTH (*Right*) using CaPSim (*Top*) and NormSim (*Bottom*) algorithms. In each histogram, the contaminated sample (with PAH) is represented in orange/pink, while as-collected sample (without PAH) is in green/purple. The Wasserstein distance (WD), which quantifies the dissimilarity between these distributions (details in *Materials and Methods*), is indicated in the upper left corner of each plot. (B) Detection performance comparison between NormSim and CaPSim using experimental spectra and DFT-calculated spectra as references. Results show area under the receiver operating characteristic curve (AUROC) values for PYR detection (*Top left*), ANTH detection (*Top right*), and their average (*Bottom left*) using Pearson’s Correlation as the similarity metric. The bottom right panel compares performance across six different similarity metrics (A–F) using DFT-calculated spectra. Error bars represent SD from five different train-validation-test splits.

To quantify this separation, we measured the WD between distributions (Materials and Methods), with higher values indicating a better distinction between contaminated and as-collected samples. For PYR, CaPSim achieves a separation value of 0.642, significantly outperforming NormSim with only a 0.079 separation value. Similarly, for ANTH, CaPSim yields a value of 0.312, while NormSim achieves only 0.090. These results demonstrate the superior performance of CaPE and CaPSim in isolating characteristic spectral features and improving detection accuracy, even when using DFT-calculated spectra as reference. Additional technical details on the statistical analysis are provided in the Method section.

This capability is particularly significant for environmental monitoring. While experimental Raman libraries exist for common PAHs, these parent compounds can be modified in innumerable ways to form PACs, which are often more toxic than the original PAHs (14). Most PACs have never been isolated and purified, so experimental reference Raman spectra are largely unavailable. As shown in Fig. 4B, we evaluated detection performance using both experimental and DFT-calculated reference spectra. Logistic regression models were trained using each similarity metric to classify contaminants in soil spectra, with algorithm performance evaluated through the AUROC Curve (see Materials and Methods for details) (23). The DFT-calculated spectra match experimental spectra, especially when combined with our CaPSim approach. Additionally, CaPSim consistently outperforms NormSim across different similarity metrics (Fig. 4 B, Bottom right), demonstrating low variability between tests. This robust performance suggests that our DFT-CaPSim approach could be valuable for environmental risk assessment applications.

Conclusion

This study demonstrates the synergy of ML algorithms, SERS, and DFT theoretical calculations for detecting PAHs in soil samples. Our results show that CaPE and CaPSim significantly outperform traditional full-spectrum comparison approaches, in particular with samples obtained from complex environmental matrices where background noise can interfere with detection. We demonstrate that combining CaPE and CaPSim with DFT-calculated theoretical reference Raman spectra, achieves a detection performance comparable to using experimentally obtained Raman reference spectra. This combination of theoretically calculated Raman spectra and physics-informed ML algorithms may ultimately eliminate the need for pure reference samples, particularly for cases where they are not available. These principles can be extended to detect other contaminants in diverse environmental samples, offering a cost-effective, computationally driven method for environmental monitoring. While demonstrated here for SERS, this approach should work equally well for IR spectroscopic detection, where surface-enhanced infrared absorption spectroscopy (SEIRA) is particularly useful for more polar, hydrophilic molecules with heteroatomic bonds (24). The approach could be particularly valuable for emerging contaminants where obtaining pure reference samples is challenging or when rapid screening of multiple potential contaminants is required. This work establishes a framework for building comprehensive in silico spectral libraries that could enable the detection and identification of environmental contaminants. Future research should systematically evaluate performance on increasingly complex mixtures containing multiple PAHs, as spectral complexity may increase with spectral overlap. Our CaPE and CaPSim algorithms are designed to handle multiple compounds through characteristic peak analysis, which could be applicable to a range of environmental samples.

Materials and Methods

Materials.

(3-aminopropyl) triethoxysilane (ES, 99%), tetra chloroauric acid (HAuCl₄·3H₂O), tetrakis hydroxymethyl phosphonium chloride (THPC), poly-L-lysine hydrobromide (MW 150,000 to 300,000) poly-L-Lysine (PLL) pyrene, and anthracene, were purchased from Sigma-Aldrich. Formaldehyde (37%), sulfuric acid (H₂SO₄, 100%), hydrogen peroxide (H₂O₂, 30%), potassium dihydrogen phosphate (KH₂PO₄), and 200-proof ethanol were obtained from Fisher Scientific. All chemicals were used as received without further purification. Water was deionized and filtered by a Milli-Q water system (18.2 M Ωcm at 25 °C, Millipore). Quartz slides were obtained for substrate fabrication (Fisher Scientific). The 120 nm diameter aminated SiO₂ cores in ethanol (10 mg/mL) were purchased from nanoComposix, Inc., San Diego, CA.

Synthesis of Au Nanoshells (Au NS).

The Au NSs were synthesized using a previously reported procedure (25). Small Au colloids (1 to 3 nm diameters) were synthesized by reducing chloroauric acid with a THPC reducing agent. Then, 300 μL aminated SiO₂ cores were incubated with 20 mL (1 to 3 nm diameters) Au colloids for 24 h. to enable the attachment of the Au colloids onto the aminated SiO₂ surface, followed by multiple cleaning cycles to remove the excess colloids. Following this step, an electroless plating process was performed to deposit a continuous Au shell onto the SiO₂–Au colloid surface. The electroless deposition process involved the reduction of Au from a 3 mL solution of 1.8 mM potassium carbonate and 0.4 μM chloroauric acids with 15 μL formaldehyde. The Au NSs were fabricated with the core and shell radii [r₁, r₂] = [200, 215] nm, corresponding to an Au NS dipole plasmon resonance of 1,067 nm and a quadrupole plasmon resonance of 769 in H₂O.

SERS Substrate Preparation.

Cleaned quartz slides (from Thermo Scientific™ Quartz microscope slide, fused, 76.2 × 25.4 mm) were immersed in 0.01% w/v aqueous solution of PLL (MW 150,000 to 300,000) for 5 to facilitate subsequent attachment of nanoparticles on the quartz surface. The surfaces were cleaned with water and dried in a flow of N₂. Cut silicone isolators (Grace Bio-Labs Press-To-Seal silicone isolator, No PSA, 9 mm diameter) were placed on the quartz-PLL substrate, followed by a dry drop deposition of 100 μL 10¹⁰ part/mL Au NS aqueous solution into the isolator well. Then, 20 μL of samples were directly dried on the Au NSs SERS substrates. Before acquiring the SERS spectra, the substrates were fully immersed in Milli-Q H₂O.

Analysis of Solvent-Extractable PAHs from Contaminated Soil.

As-collected background soil was collected from Rice University in Houston, TX, consisting of 43% clay and 37% sand. After collection, the soil was dried at 60 °C for 72 h and sieved to a particle size of less than 1 mm. To prepare the PAH-contaminated soil, 2.5 g of as-collected soil was mixed with pyrene, anthracene, or a mixture of both, at concentrations ranging from 10 to 500 mg/mL. While this spiking approach may not accurately reproduce the complex PAH weathering processes observed in soils that have been contaminated for a long time, it serves as a reductionist approach to minimize confounding variables and facilitates testing our physics-informed machine learning algorithm to detect and identify specific PAHs in soil samples (26–28). After spiking, the soil mixture was capped, shaken for 2 min to promote absorption, and left to dry at room temperature until the acetone completely evaporated. Once contaminated, the soil underwent PAH extraction using two extraction methods: a) filtration and b) ASE, followed by analysis via GC-MS. To ensure reliability, our method were benchmarked against ASE, a well-established standard extraction technique (28).

Filtration extraction.

The extraction was performed with 15 mL of HPLC-grade acetone, after which the solution was shaken for 2 min, left to settle for 1 min, and filtered through a 0.20 μm filter into 2 mL GC–MS vials.

ASE.

A Dionex ASE 350 was used to extract PAHs from contaminated soils following EPA Method 3545A. In brief, 2.5 g of contaminated soil was mixed with 0.5 g of diatomaceous earth, a drying agent, and placed in 5 mL extraction cells. The cells were then loaded into the ASE 350 with the following conditions: system pressure of 1,500 psi, oven temperature of 100 °C, a 5-min heat-up time, and a static time of 1 min. HPLC-grade acetone was used as the solvent, with a flush volume of 60% of the extraction cell volume and a nitrogen purge at 130 psi for 60 s.

PAHs were quantified using a gas chromatograph (Agilent 7820A) equipped with a mass spectrometer (Agilent 5977E) following EPA Method 8275A, as previously described (8). An HP-5 ms Ultra Inert capillary column (30 m × 0.25 mm × 0.25 μm film thickness) from Agilent Technologies was used. Helium was the carrier gas with a 1 mL/min flow rate, and splitless injection was performed at an inlet temperature of 275 °C. The initial column temperature was set at 80 °C and held for 1 min, then ramped at 25 °C/min to 200 °C, followed by an increase to 256 °C at 8 °C/min, where it was held for 5 min. Pyrene was quantified using selected ion monitoring (SIM) mode with target ions at m/z 202 and 101, while anthracene was quantified using m/z 178.

Measurements.

SERS measurements were acquired with a Renishaw inVia Raman microscope (Renishaw) with a 785-nm excitation wavelength and 55-μW laser power at the samples. Backscattered light was collected using a 63× water immersion objective lens (Leica) NA = 0.9 with a 20-s exposure time. Extinction measurements were performed on a Cary 5000 UV/Vis/NIR Varian spectrophotometer. SEM measurements were performed using an FEI Quanta 400 field emission SEM at a 20 kV scanning electron microscope acceleration voltage. GC–MS with an Agilent 7820A gas chromatograph, coupled with an Agilent 5977E mass spectrometer, was used to quantify the PAH extraction. The separation was performed with a HP-5 ms Ultra Inert capillary column (30 m × 0.25 mm × 0.25 μm film thickness) from Agilent Technologies. For ASE, a Dionex ASE 350 was employed to extract PAHs from contaminated soils in accordance with EPA Method 3545A (28).

Simulated PAH Spectra.

The DFT calculations were performed using the Gaussian 16 program (29) and visualized with GaussView 6.0 software (30). We employed the B3LYP exchange-correlation functional (31) in combination with the Pople 6-31G(d,p) basis set (32). Benchmarking tests with the larger triple-zeta 6-311G(2d,p) basis set did not significantly improve the predicted PAH wavenumbers compared to available experimental data (SI Appendix, Fig. S1). Electronic and structure optimizations were optimized through iterative solutions of the self-consistent field (SCF) equations until the energy gradient between iterations reached a convergence threshold of 10⁻⁸ Hartree. Geometrical optimization was achieved by minimizing forces to within 3.0 × 10⁻⁴ Hartree Bohr⁻¹. Structural minima were confirmed by ensuring the absence of imaginary frequencies during vibrational analyses. A scaling factor of 0.977 was applied to the calculated wavenumbers, which was determined by minimizing the RMS error (RMSE) between a set of 13 calculated PAHs and their corresponding experimental wavenumbers (see further details in SI Appendix). (We also could have used the recommended value from the NIST database (0.961 for the B3LYP functional) without significant changes in performance.) The list of PAHs and background molecules computed using DFT is provided in SI Appendix, Table S1. Further details on the computational settings and the algorithm used to generate the spectra in Gaussian are available in the methods section of Sánchez-Alvarado et al. (24).

When applying CaPSim, we focused exclusively on the specific locations of the peaks in the spectra rather than the peak amplitudes, as wavenumbers directly correspond to the frequencies of absorbed radiation. Each chemical bond exhibits characteristic vibrational frequencies associated with its stretching and bending motions, reflected as peaks in the Raman spectrum.

Computational Methods.

Machine learning plays a vital role in two key stages of our approach. First, the CaPE algorithm is utilized to identify characteristic peak locations from both experimental and DFT-calculated spectra. Second, a logistic regression model is trained on the similarity values between soil and reference spectra to detect the presence of PAHs. In this and the following sections, we describe the proposed contaminant detection strategy, the similarity metrics used, and the evaluation procedures in detail. Let $q \in R^{D}$ represent the query spectrum of a soil sample, where $D$ denotes the dimensionality of the spectral data. We aim to determine the presence of a target contaminant by computing a similarity score between $q$ and the reference spectrum of the contaminant $r \in R^{D}$ . The similarity metric $S : R^{D} \times R^{D} \to R$ is defined as a function that measures the degree of alignment between the spectra $q$ and $r$ . We evaluated 6 different metrics, which will be discussed later. In general, higher similarity values indicate greater similarity. For cases where multiple reference spectra $R \in R^{D \times n}$ of the contaminant $r$ are available, we compute the mean similarity $\underline{S} (q, R)$ as

\underline{S} (q, R) = \frac{1}{n} \sum_{i = 1}^{n} S (q, r_{i}),

where $t_{i}$ represents the $i$ th reference spectrum of the target contaminant. This aggregated similarity score $\underline{S} (q, R)$ is then input into a trained logistic regression model, which converts the similarity score into a probability of contamination. The probability $P (y = 1 | q)$ that the query soil sample $q$ contains the target contaminant is given by

P (q) = \frac{1}{1 + e x p [- (β_{0} + β_{1} \underline{S} (q, R))]},

where $β_{0}$ and $β_{1}$ are the logistic regression model parameters learned from the training data. This probability score is then used to make the final prediction, where typically $P (q) > > 0.5$ indicates the presence of the contaminant in the soil sample, and $P (q) \leq 0.5$ indicates its absence.

Preprocessing.

We follow a standard preprocessing routine widely used in related work, beginning with a resampling step to ensure all spectra are aligned to the same set of wavenumbers. Next, we remove the cosmic ray artifacts using the Whitaker and Hayes algorithm (33). Then, we apply the Savitzky–Golay filter with a window size of 11 and polynomial order of 3 to smooth the spectra (34), which preserves peak shapes while reducing noise in the spectra, which is essential for accurate and robust peak matching. Next, we perform a baseline removal step that eliminates the slow-changing trends typically present in the SERS and Raman spectra using the airPLS method (ZhangFit with the default parameters) from the BaselineRemoval Python package (35). These baseline trends, often shared across different molecules, can interfere with spectral matching algorithms by introducing unwanted signals, making their removal essential for improving detection accuracy (22, 36). Finally, a scalar noise floor is estimated and subtracted, as finer-grained noise in the spectrum may remain after baseline removal. The noise floor is estimated using the median of medians approach, inspired by the median filtering techniques commonly used in image processing (37).

We also explored different normalization methods, including 0–1 normalization ( $q^{'} = (q - q_{k}) / (q - q)$ ), $l_{\infty}$ -normalization ( $q^{'} = q / q$ ), $l_{2}$ -normalization ( $q^{'} = q / {‖ q ‖}_{2}$ ), and no normalization ( $q^{'} = q$ ). Normalization can be useful for adjusting spectra that are originally on different scales, but each method has its own advantages and disadvantages. For example, while 0–1 normalization ensures all data are transformed to a consistent range, it can amplify noise when the characteristic peak intensities are low. On the other hand, $l_{2}$ -normalization effectively reduces the influence of noisy spectra by scaling down their intensity, but it may introduce inconsistent scales that are unsuitable for methods relying on fixed thresholds. The $l_{\infty}$ -normalization balances scaling but remains sensitive to the relative scale between the highest peak and other peaks, which can vary unpredictably across spectra. Ultimately, the performance of these normalization methods, compared to no normalization, should be evaluated on a validation set in practice.

Similarity Metrics.

We considered five commonly used similarity metrics from the literature for evaluating spectral matching: Pearson’s correlation, Euclidean distance, cosine similarity, first-difference cosine similarity, and soft intersection over union, as well as a new metric named weighted dot product we proposed (38, 39). The definitions of these metrics are provided below using the following helper variables: $\bar{q} = \sum_{k = 1}^{D} q_{k} / D$ , $\bar{r} = \sum_{k = 1}^{D} r_{k} / D$ , $\bar{q^{2}} = \sum_{k = 1}^{D} q_{k}^{2} / D$ , $\bar{r^{2}} = \sum_{k = 1}^{D} r_{k}^{2} / D$ , $\bar{qr} = \sum_{k = 1}^{D} q_{k} r_{k} / D$ , $Δ q_{k} = q_{k + 1} - q_{k}$ and $Δ r_{k} =$ $r_{k + 1} - r_{k}$ , $\bar{{Δ q}^{2}} = \sum_{k = 1}^{D - 1} {Δ q}_{k}^{2} / (D - 1)$ , $\bar{{Δ r}^{2}} = \sum_{k = 1}^{D - 1} {Δ r}_{k}^{2} / (D - 1)$ , and $\bar{Δ q Δ r} = \sum_{k = 1}^{D - 1} {Δ q}_{k} {Δ r}_{k} / (D - 1)$ .

Then, the similarity metrics are defined as

1. Pearson’s correlation:

S_{Pearson} (q, r) = \frac{\bar{qr} + \bar{q} \bar{r}}{\sqrt{(\bar{q^{2}} - {\bar{q}}^{2}) (\bar{r^{2}} - r^{2})}},

2. Cosine similarity:

S_{cossim} (q, r) = \frac{{\bar{qr}}^{2}}{\bar{q^{2}} \cdot \bar{r^{2}}},

3. Dot product:

S_{dot} (q, r) = D \bar{qr},

4. First-difference cosine similarity:

S_{c o s s i m 1 s t d i f f} (q, r) = \frac{{\bar{Δ q Δ r}}^{2}}{\bar{{Δ q}^{2}} \cdot \bar{{Δ r}^{2}}},

5. Soft intersection over union:

S_{softIOU} (q, r) = \frac{\bar{qr}}{\bar{q} + \bar{r} - \bar{qr}},

Which measures the overlap between the characteristic peaks of the query spectrum and reference spectra, taking into account both peak positions and intensities.

6. Weighted dot product:

S_{w e i g h t e d d o t} (q, r) = {(D \bar{qr})}^{a} \cdot (1 - e^{- \frac{c_{t}}{C}}),

where $a \in [0, 1]$ controls how much weight is given to larger peaks. For $a = 1$ , the formula becomes a standard dot product, while for $a = 0$ , it becomes a binary count that gives more credit to low-intensity peaks. Also, $c_{t}$ represents the number of matched peaks between input query $q$ and reference $r$ , given the threshold for counting is $t .$ A match occurs wherever the element-wise product between $q$ and $r$ exceeds the threshold $t$ . Formally, the count of matched peaks is defined as, $c_{t} = \sum_{k = 1}^{D} I [q_{k} r_{k} > > t]$ , where $I [\cdot]$ is an indicator function that is 1 if the condition is met and 0 otherwise. The threshold $t$ is set to a fixed value, which can be determined through tuning with a validation set. Alternatively, a “knee locator” can be used to identify a value of $t$ that corresponds to a knee point in the plot of $c_{t}$ vs. $t$ , where we hypothesize captures most of the important peak matchings (40). Finally, $C$ is a scaling parameter that controls when the “gating” function activates. This function ensures that the dot product contributes meaningfully only when multiple peaks are matched, preventing accidental matches between background noise and contaminant peaks from skewing the results.

Each similarity metric has its own characteristic range: Pearson’s correlation ranges from −1 to 1, with 1 indicating perfect correlation. Cosine similarity, first-difference cosine similarity, and soft intersection over union are bounded between 0 and 1. In contrast, the dot product and weighted dot product do not have fixed upper bounds, as their values depend on the input spectra intensities. While we evaluated multiple similarity metrics, Pearson’s correlation was chosen for our primary analysis due to its well-defined range (−1 to 1) and robustness to intensity variations in spectral data.

Normal Similarity (NormSim) and Characteristic Peak Similarity (CaPSim).

NormSim and CaPSim refer to two different approaches for calculating spectral similarity in the context of contaminant detection, and this notion can be applied to any similarity metric $S$ . In NormSim, the similarity between the query spectrum $q$ and the reference spectrum $r$ is computed using the full spectral data, meaning that all wavenumbers and their corresponding intensities are considered in the calculation. This approach captures the entire spectral profile but may include irrelevant regions with no characteristic peaks, potentially introducing noise into the similarity assessment. In contrast, CaPSim focuses on the key spectral features by calculating the similarity only between the extracted peaks of the query and reference spectra. By narrowing the comparison to specific, informative wavenumber regions, CaPSim reduces the influence of background noise and enhances the ability to match contaminants accurately. Similar to the original CaPSim paper, we applied the characteristic peak extraction (CaPE) algorithm for peak extraction (15, 16).

The WD.

To quantify the dissimilarity between the (probability) distributions of similarity scores for contaminated and as-collected samples, we employ the WD, also known as the Earth Mover’s Distance. This metric provides a robust measure of the separation between probability distributions, with larger values indicating greater dissimilarity. Unlike simple summary statistics such as means or medians, the WD takes into account the overall shape and spread of the distributions, making it particularly suitable for assessing the discriminative power of our detection methods. Given two sets of samples $X = \{x_{1}, x_{2}, \dots, x_{n}\}$ and $Y = \{y_{1}, y_{2}, \dots, y_{m}\}$ , we can compute the 1-WD as follows:

Step 1. Obtain empirical cumulative distribution functions (ECDFs). First, we construct the ECDFs for X and Y:

{\hat{F}}_{n} (t) = \frac{1}{n} \sum_{i = 1}^{n} I [x_{i} \leq t], {\hat{G}}_{m} (t) = \frac{1}{m} \sum_{j = 1}^{m} I [y_{j} \leq t],

where $I [\cdot]$ is the indicator function.

Step 2: Obtain quantile functions. Next, we define the empirical quantile functions (inverse of ECDFs):

{\hat{F}}_{n}^{- 1} (p) = \inf \{t \in R : {\hat{F}}_{n} (t) \geq p\}, {\hat{G}}_{m}^{- 1} (p) = \inf \{t \in R : {\hat{G}}_{m} (t) \geq p\} .

Step 3: Compute the WD. For one-dimensional data, the 1-WD has a closed-form solution:

W_{1} ({\hat{F}}_{n}, {\hat{G}}_{m}) = \int_{0}^{1} |{\hat{F}}_{n}^{- 1} (p) - {\hat{G}}_{m}^{- 1} (p)| d p .

Step 4: Discrete approximation. In practice, we compute this integral using a discrete approximation:

W_{1} (X, Y) \approx \frac{1}{N} \sum_{i = 1}^{N} |{\hat{F}}_{n}^{- 1} (\frac{i}{N}) - {\hat{G}}_{m}^{- 1} (\frac{i}{N})|,

where $N$ is a large number (e.g., $m a x (n, m)$ or $n + m - 1$ ).

Logistic Regression Model Setup.

To evaluate the effectiveness of the NormSim and CaPSim algorithms across various similarity metrics, we trained a logistic regression model for each one of the six distinct similarity metrics (A–F). The similarity values are calculated between each soil sample and each contaminant spectrum. Logistic regression was selected due to its simplicity and interpretability, allowing for a straightforward comparison of the performance between the two algorithms on different metrics.

AUROC Metric.

The performance of each logistic regression model was quantified using the AUROC curve, which provides a measure of the ability to distinguish between positive (contaminated) and negative (as-collected/reference) samples. An AUROC value of 1.0 indicates perfect classification, while a value of 0.5 indicates no better than random guessing. This metric was chosen because it provides a robust evaluation of the classification performance across a range of decision thresholds, making it ideal for assessing detection sensitivity in this study. Note that although our logistic regression model outputs probabilities of contamination, which could be used for further analyses, we focus on AUROC as our primary evaluation metric as it assesses detection performance across all possible decision thresholds. This approach is more robust than relying on raw probability values, which can be overconfident in logistic regression models, and allows for flexible threshold selection based on specific application requirements.

Train/Validation/Test Splits.

The dataset was randomly split into training, validation, and test sets using a 64:16:20 ratio. The training set was used to fit the logistic regression models, while the validation set was employed for hyperparameter tuning. The test set, which was unseen during the training and validation phases, was used to assess the final performance of each model.

Cross-Validation Strategy.

To ensure robustness and prevent overfitting, we used multiple train/validation/test splits with five random seeds. The reported AUROC values represent the average performance across these splits, with error bars indicating the SD, reflecting the model’s stability across different random initializations.

Hyperparameter Tuning.

We performed extensive hyperparameter tuning to ensure we maximize the performance of each similarity metric for both NormSim and CaPSim. For hyperparameters specific to the weight dot product, we use $a \in$ {0, 0.25, 0.5, 0.75, 1, 1.25}, $t \in$ {−1, 0.02, 0.05, 0.1, 0.2, 0.5}, and $C \in$ {0.5, 1, 2, 4}, where $t = - 1$ refers to the case when we use the knee locator method. For hyperparameters specific to CaPE, we tune the distance threshold from {12, 18, 24} (corresponding to roughly 10, 15, and 20 cm⁻¹ peak shift), the number of peaks extracted from {5, 10, 15, 20}, the CaPE normalization method from {0–1 normalization, $l_{\infty}$ -normalization}, and the CaPE feature used from {peak height, peak prominence}, which corresponds to the peak feature extracted from learned peak locations, and whether to filter out peaks that are too thin (< ~0.5 cm⁻¹ and too wide (> ~1.25 cm⁻¹). We fixed the kernel type to uniform and kernel size to 5 (~0.4 cm⁻¹) as we found these values worked well in most cases. The detection criterion is set to intensity since we only have one recording for each contaminant. For hyperparameters specific to CaPSim, we explored whether we also consider the peak locations detected from the soil samples for peak extracts. For hyperparameters that apply to both NormSim and CaPSim, we tune the spectra normalization from {0–1 normalization, $l_{\infty}$ -normalization, $l_{2}$ -normalization, no normalization}. We also explored whether to remove all negative values from the spectra since negative values could still affect the analysis after baseline removal.

Further Insights into the Discussion of Fig. 4.

The separation between contaminated and as-collected sample distributions was quantified using the WD also known as the Earth Mover’s distance, which measures the dissimilarity between probability distributions. This distance represents the minimum “cost” required to transform one distribution into another, with larger values indicating greater separation between the distributions.

While larger WDs generally indicate enhanced detection capability, it is important to note that no universal threshold guarantees reliable detection, as the required separation varies based on the specific confidence requirements of each application. Furthermore, the WD serves as a complementary metric rather than a substitute for conventional classification metrics such as AUROC, as distributions can have overlapping regions even when they exhibit large WDs. The AUROC is a standard metric for assessing a classification model’s discriminative ability across varying thresholds. Therefore, employing a combined approach using multiple evaluation metrics provides a more comprehensive understanding of detection performance. The significant difference in WDs between CaPSim and NormSim (8.1 times greater for pyrene and 3.5 times greater for anthracene) quantitatively demonstrates how focusing on characteristic peaks rather than full spectral analysis improves the distinction between PAH-contaminated and uncontaminated soil samples.

The consistent performance across different similarity metrics, including Pearson’s correlation, cosine similarity, and Euclidean distance, further demonstrates the robustness of the CaPSim approach. This consistency, combined with the small error bars, indicates that the DFT-CaPSim methodology is not dependent on specific mathematical formulations of similarity but rather captures fundamental spectral relationships between reference compounds and environmental samples.

Supplementary Material

Appendix 01 (PDF)

pnas.2427069122.sapp.pdf^{(1.1MB, pdf)}

Acknowledgments

This work was financially supported by the National Institute of Environmental Health Sciences of the NIH (P42ES027725-01), the Welch Foundation Grants C-1220 (N.J.H.) and C-1222 (P.N.), and the Carl and Lillian Illig Fellowship (Smalley-Curl Institute, H20398-239440).

Author contributions

Y.J., O.N., P.N., T.P.S., P.J.J.A., A.P., and N.J.H. designed research; Y.J., O.N., S.B.D., P.J., and A.B.S.-A. performed research; S.B.D. contributed new reagents/analytic tools; Y.J., O.N., P.J., and A.B.S.-A. analyzed data; and Y.J., O.N., P.N., T.P.S., P.J.J.A., A.P., and N.J.H. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

Reviewers: C.M., Arizona State University; T.W.O., Northwestern University; and S.S., Oregon State University.

Contributor Information

Ankit Patel, Email: abp4@rice.edu.

Naomi J. Halas, Email: halas@rice.edu.

Data, Materials, and Software Availability

All study data are included in the article and/or SI Appendix.

Supporting Information

References

1.Masoom H., et al. , Soil organic matter in its native state: Unravelling the most complex biomaterial on earth. Environ. Sci. Technol. 50, 1670–1680 (2016). [DOI] [PubMed] [Google Scholar]
2.Koshlaf E., Ball A. S., Soil bioremediation approaches for petroleum hydrocarbon polluted environments. AIMS Microbiol. 3, 25–49 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Langenbach T., Persistence and Bioaccumulation of Persistent Organic Pollutants (POPs) (InTech, 2013). [Google Scholar]
4.Manzano C., Hoh E., Simonich S. L. M., Quantification of complex polycyclic aromatic hydrocarbon mixtures in standard reference materials using comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometry. J. Chromatogr. A 1307, 172–179 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Durant J. L., Busby W. F., Lafleur A. L., Penman B. W., Crespi C. L., Human cell mutagenicity of oxygenated, nitrated and unsubstituted polycyclic aromatic hydrocarbons associated with urban aerosols. Mutat. Res. Genet. Toxicol. Environ. Mutagen 371, 123–157 (1996). [DOI] [PubMed] [Google Scholar]
6.Dipple A., Polycyclic aromatic hydrocarbon carcinogenesis—An introduction. ACS Symp. Ser. 283, 1–17 (1985). [Google Scholar]
7.Denison S. B., Da Silva P. D., Koester C. P., Alvarez P. J. J., Zygourakis K., Clays play a catalytic role in pyrolytic treatment of crude-oil contaminated soils that is enhanced by ion-exchanged transition metals. J. Hazard. Mater. 437, 129295 (2022). [DOI] [PubMed] [Google Scholar]
8.Denison S. B., et al. , Pyro-catalytic degradation of pyrene by bentonite-supported transition metals: Mechanistic insights and trade-offs with low pyrolysis temperature. Environ. Sci. Technol. 57, 14373–14383 (2023). [DOI] [PubMed] [Google Scholar]
9.Denison S. B., Jin P. X., Zygourakis K., Senftle T. P., Alvarez P. J. J., Mechanistic implications of the varying susceptibility of PAHs to pyro-catalytic treatment as a function of their ionization potential and hydrophobicity. Environ. Sci. Technol. 58, 13521–13528 (2024). [DOI] [PubMed] [Google Scholar]
10.Kneipp K., et al. , Single molecule detection using surface-enhanced Raman scattering (SERS). Phys. Rev. Lett. 78, 1667–1670 (1997). [Google Scholar]
11.Langer J., et al. , Present and future of surface-enhanced Raman scattering. ACS Nano 14, 28–117 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Escher B. I., Stapleton H. M., Schymanski E. L., Tracking complex mixtures of chemicals in our changing environment. Science 367, 388 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lussier F., Thibault V., Charron B., Wallace G. Q., Masson J. F., Deep learning and artificial intelligence methods for Raman and surface-enhanced Raman scattering. Trends Anal. Chem. 124, 115796 (2020). [Google Scholar]
14.Andersson J. T., Achten C., Time to say goodbye to the 16 EPA PAHs? Toward an up-to-date use of PACs for environmental purposes. Polycyclic Aromat. Compd. 35, 330–354 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Bajomo M. M., et al. , Computational chromatography: A machine learning strategy for demixing individual chemical components in complex mixtures. Proc. Natl. Acad. Sci. U.S.A. 119, e2211406119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ju Y. L., et al. , Identifying surface-enhanced Raman spectra with a Raman library using machine learning. ACS Nano 17, 21251–21261 (2023). [DOI] [PubMed] [Google Scholar]
17.Achten C., Andersson J. T., Overview of polycyclic aromatic compounds (PAC). Polycyclic Aromat. Compd. 35, 177–186 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Haleyur N., et al. , Comparison of rapid solvent extraction systems for the GC–MS/MS characterization of polycyclic aromatic hydrocarbons in aged, contaminated soil. MethodsX 3, 364–370 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Primbs T., Genualdi S., Simonich S. M., Solvent selection for pressurized liquid extraction of polymeric sorbents used in air sampling. Environ. Toxicol. Chem. 27, 1267–1272 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Le F., et al. , Metallic nanoparticle arrays: A common substrate for both surface-enhanced Raman scattering and surface-enhanced infrared absorption. ACS Nano 2, 707–718 (2008). [DOI] [PubMed] [Google Scholar]
21.Neumann O., et al. , Surface-enhanced Raman spectroscopy: From the few-analyte limit to hot-spot saturation. J. Phys. Chem. C 128, 8649–8659 (2024). [Google Scholar]
22.Zhang Z. M., Chen S., Liang Y. Z., Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135, 1138–1146 (2010). [DOI] [PubMed] [Google Scholar]
23.Hanley J. A., McNeil B. J., The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). [DOI] [PubMed] [Google Scholar]
24.Sánchez-Alvarado A. B., et al. , Combined surface-enhanced Raman and infrared absorption spectroscopies for streamlined chemical detection of polycyclic aromatic hydrocarbon-derived compounds. ACS Nano 17, 25697–25706 (2023). [DOI] [PubMed] [Google Scholar]
25.Oldenburg S. J., Averitt R. D., Westcott S. L., Halas N. J., Nanoengineering of optical resonances. Chem. Phys. Lett. 288, 243–247 (1998). [Google Scholar]
26.USEPA, Method 8270D. Semivolatile Organic Compounds by Gas Chromatography/Mass Spectrometry (2014). https://archive.epa.gov/epa/sites/production/files/2015-12/documents/8270d.pdf. Accessed 19 January 2025.
27.USEPA, Method 8275A. Semivolatile organic compounds (PAHs and PCBs) in soils, sludges, and solid wastes using thermal extraction/gas chromatography/mass spectrometry (TE/GC/MS) (1996). https://www.epa.gov/sites/default/files/2015-12/documents/8275a.pdf. Accessed 19 January 2025.
28.Thermo Fisher Scientific, Application Note 313. Extraction of PAHs from environmental samples by accelerated solvent extraction (ASE). Meets the requirements of U.S. EPA Method 3545. Dionex. https://tools.thermofisher.com/content/sfs/brochures/AN-313-Extraction-PAH-Environmental-Sampes-LPN0632.pdf. Accessed 27 January 2025.
29.Frisch M. E., et al. , Gaussian 16 (Gaussian Inc., Wallingford, CT, 2016). [Google Scholar]
30.Dennington R., Keith T. A., Millam J. M., GaussView 6.0.16 (Semichem Inc., Shawnee Mission, KS, 2016), pp. 143–150. [Google Scholar]
31.Stephens P. J., Devlin F. J., Chabalowski C. F., Frisch M. J., Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. J. Phys. Chem. C 98, 11623–11627 (1994). [Google Scholar]
32.Hehre W. J., Ditchfield R., Pople J. A., Self-consistent molecular orbital methods. 12. Further extensions of Gaussian-type basis sets for use in molecular orbital studies of organic molecules. J. Chem. Phys. 56, 2257 (1972). [Google Scholar]
33.Whitaker D. A., Hayes K., A simple algorithm for despiking Raman spectra. Chemometr. Intell. Lab. Syst. 179, 82–84 (2018). [Google Scholar]
34.Savitzky A., Golay M. J. E., Smoothing + differentiation of data by siplified least squares procedures. Anal. Chem. 36, 1627 (1964). [Google Scholar]
35.Haque A., Feature engineering & selection for explainable models: A second course for data scientists (2022). https://www.lulu.com. Accessed 15 November 2024.
36.Abell J. L., Lee J., Zhao Q., Szu H., Zhao Y. P., Differentiating intrinsic SERS spectra from a mixture by sampling induced composition gradient and independent component analysis. Analyst 137, 73–76 (2012). [DOI] [PubMed] [Google Scholar]
37.Hou Y., et al. , The state-of-the-art review on applications of intrusive sensing, image processing techniques, and machine learning methods in pavement monitoring and analysis. Engineering 7, 845–856 (2021). [Google Scholar]
38.Samuel A. Z., et al. , On selecting a suitable spectral matching method for automated analytical applications of Raman spectroscopy. ACS Omega 6, 2060–2065 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Huang Y. F., Tang Z. R., Chen D., Su K. X., Chen C. B., Batching soft IoU for training semantic segmentation networks. IEEE Signal Process. Lett. 27, 66–70 (2020). [Google Scholar]
40.Satopaa V., Albrecht J., Irwin D., Raghavan B., “Finding a “Kneedle” in a haystack: Detecting knee points in system behavior” in 31st International Conference on Distributed Computing Systems Workshops (Minneapolis, MN, 2011), pp. 166–171. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2427069122.sapp.pdf^{(1.1MB, pdf)}

Data Availability Statement

All study data are included in the article and/or SI Appendix.

[r1] 1.Masoom H., et al. , Soil organic matter in its native state: Unravelling the most complex biomaterial on earth. Environ. Sci. Technol. 50, 1670–1680 (2016). [DOI] [PubMed] [Google Scholar]

[r2] 2.Koshlaf E., Ball A. S., Soil bioremediation approaches for petroleum hydrocarbon polluted environments. AIMS Microbiol. 3, 25–49 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Langenbach T., Persistence and Bioaccumulation of Persistent Organic Pollutants (POPs) (InTech, 2013). [Google Scholar]

[r4] 4.Manzano C., Hoh E., Simonich S. L. M., Quantification of complex polycyclic aromatic hydrocarbon mixtures in standard reference materials using comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometry. J. Chromatogr. A 1307, 172–179 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Durant J. L., Busby W. F., Lafleur A. L., Penman B. W., Crespi C. L., Human cell mutagenicity of oxygenated, nitrated and unsubstituted polycyclic aromatic hydrocarbons associated with urban aerosols. Mutat. Res. Genet. Toxicol. Environ. Mutagen 371, 123–157 (1996). [DOI] [PubMed] [Google Scholar]

[r6] 6.Dipple A., Polycyclic aromatic hydrocarbon carcinogenesis—An introduction. ACS Symp. Ser. 283, 1–17 (1985). [Google Scholar]

[r7] 7.Denison S. B., Da Silva P. D., Koester C. P., Alvarez P. J. J., Zygourakis K., Clays play a catalytic role in pyrolytic treatment of crude-oil contaminated soils that is enhanced by ion-exchanged transition metals. J. Hazard. Mater. 437, 129295 (2022). [DOI] [PubMed] [Google Scholar]

[r8] 8.Denison S. B., et al. , Pyro-catalytic degradation of pyrene by bentonite-supported transition metals: Mechanistic insights and trade-offs with low pyrolysis temperature. Environ. Sci. Technol. 57, 14373–14383 (2023). [DOI] [PubMed] [Google Scholar]

[r9] 9.Denison S. B., Jin P. X., Zygourakis K., Senftle T. P., Alvarez P. J. J., Mechanistic implications of the varying susceptibility of PAHs to pyro-catalytic treatment as a function of their ionization potential and hydrophobicity. Environ. Sci. Technol. 58, 13521–13528 (2024). [DOI] [PubMed] [Google Scholar]

[r10] 10.Kneipp K., et al. , Single molecule detection using surface-enhanced Raman scattering (SERS). Phys. Rev. Lett. 78, 1667–1670 (1997). [Google Scholar]

[r11] 11.Langer J., et al. , Present and future of surface-enhanced Raman scattering. ACS Nano 14, 28–117 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Escher B. I., Stapleton H. M., Schymanski E. L., Tracking complex mixtures of chemicals in our changing environment. Science 367, 388 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Lussier F., Thibault V., Charron B., Wallace G. Q., Masson J. F., Deep learning and artificial intelligence methods for Raman and surface-enhanced Raman scattering. Trends Anal. Chem. 124, 115796 (2020). [Google Scholar]

[r14] 14.Andersson J. T., Achten C., Time to say goodbye to the 16 EPA PAHs? Toward an up-to-date use of PACs for environmental purposes. Polycyclic Aromat. Compd. 35, 330–354 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Bajomo M. M., et al. , Computational chromatography: A machine learning strategy for demixing individual chemical components in complex mixtures. Proc. Natl. Acad. Sci. U.S.A. 119, e2211406119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Ju Y. L., et al. , Identifying surface-enhanced Raman spectra with a Raman library using machine learning. ACS Nano 17, 21251–21261 (2023). [DOI] [PubMed] [Google Scholar]

[r17] 17.Achten C., Andersson J. T., Overview of polycyclic aromatic compounds (PAC). Polycyclic Aromat. Compd. 35, 177–186 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18] 18.Haleyur N., et al. , Comparison of rapid solvent extraction systems for the GC–MS/MS characterization of polycyclic aromatic hydrocarbons in aged, contaminated soil. MethodsX 3, 364–370 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Primbs T., Genualdi S., Simonich S. M., Solvent selection for pressurized liquid extraction of polymeric sorbents used in air sampling. Environ. Toxicol. Chem. 27, 1267–1272 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Le F., et al. , Metallic nanoparticle arrays: A common substrate for both surface-enhanced Raman scattering and surface-enhanced infrared absorption. ACS Nano 2, 707–718 (2008). [DOI] [PubMed] [Google Scholar]

[r21] 21.Neumann O., et al. , Surface-enhanced Raman spectroscopy: From the few-analyte limit to hot-spot saturation. J. Phys. Chem. C 128, 8649–8659 (2024). [Google Scholar]

[r22] 22.Zhang Z. M., Chen S., Liang Y. Z., Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135, 1138–1146 (2010). [DOI] [PubMed] [Google Scholar]

[r23] 23.Hanley J. A., McNeil B. J., The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). [DOI] [PubMed] [Google Scholar]

[r24] 24.Sánchez-Alvarado A. B., et al. , Combined surface-enhanced Raman and infrared absorption spectroscopies for streamlined chemical detection of polycyclic aromatic hydrocarbon-derived compounds. ACS Nano 17, 25697–25706 (2023). [DOI] [PubMed] [Google Scholar]

[r25] 25.Oldenburg S. J., Averitt R. D., Westcott S. L., Halas N. J., Nanoengineering of optical resonances. Chem. Phys. Lett. 288, 243–247 (1998). [Google Scholar]

[r26] 26.USEPA, Method 8270D. Semivolatile Organic Compounds by Gas Chromatography/Mass Spectrometry (2014). https://archive.epa.gov/epa/sites/production/files/2015-12/documents/8270d.pdf. Accessed 19 January 2025.

[r27] 27.USEPA, Method 8275A. Semivolatile organic compounds (PAHs and PCBs) in soils, sludges, and solid wastes using thermal extraction/gas chromatography/mass spectrometry (TE/GC/MS) (1996). https://www.epa.gov/sites/default/files/2015-12/documents/8275a.pdf. Accessed 19 January 2025.

[r28] 28.Thermo Fisher Scientific, Application Note 313. Extraction of PAHs from environmental samples by accelerated solvent extraction (ASE). Meets the requirements of U.S. EPA Method 3545. Dionex. https://tools.thermofisher.com/content/sfs/brochures/AN-313-Extraction-PAH-Environmental-Sampes-LPN0632.pdf. Accessed 27 January 2025.

[r29] 29.Frisch M. E., et al. , Gaussian 16 (Gaussian Inc., Wallingford, CT, 2016). [Google Scholar]

[r30] 30.Dennington R., Keith T. A., Millam J. M., GaussView 6.0.16 (Semichem Inc., Shawnee Mission, KS, 2016), pp. 143–150. [Google Scholar]

[r31] 31.Stephens P. J., Devlin F. J., Chabalowski C. F., Frisch M. J., Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. J. Phys. Chem. C 98, 11623–11627 (1994). [Google Scholar]

[r32] 32.Hehre W. J., Ditchfield R., Pople J. A., Self-consistent molecular orbital methods. 12. Further extensions of Gaussian-type basis sets for use in molecular orbital studies of organic molecules. J. Chem. Phys. 56, 2257 (1972). [Google Scholar]

[r33] 33.Whitaker D. A., Hayes K., A simple algorithm for despiking Raman spectra. Chemometr. Intell. Lab. Syst. 179, 82–84 (2018). [Google Scholar]

[r34] 34.Savitzky A., Golay M. J. E., Smoothing + differentiation of data by siplified least squares procedures. Anal. Chem. 36, 1627 (1964). [Google Scholar]

[r35] 35.Haque A., Feature engineering & selection for explainable models: A second course for data scientists (2022). https://www.lulu.com. Accessed 15 November 2024.

[r36] 36.Abell J. L., Lee J., Zhao Q., Szu H., Zhao Y. P., Differentiating intrinsic SERS spectra from a mixture by sampling induced composition gradient and independent component analysis. Analyst 137, 73–76 (2012). [DOI] [PubMed] [Google Scholar]

[r37] 37.Hou Y., et al. , The state-of-the-art review on applications of intrusive sensing, image processing techniques, and machine learning methods in pavement monitoring and analysis. Engineering 7, 845–856 (2021). [Google Scholar]

[r38] 38.Samuel A. Z., et al. , On selecting a suitable spectral matching method for automated analytical applications of Raman spectroscopy. ACS Omega 6, 2060–2065 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r39] 39.Huang Y. F., Tang Z. R., Chen D., Su K. X., Chen C. B., Batching soft IoU for training semantic segmentation networks. IEEE Signal Process. Lett. 27, 66–70 (2020). [Google Scholar]

[r40] 40.Satopaa V., Albrecht J., Irwin D., Raghavan B., “Finding a “Kneedle” in a haystack: Detecting knee points in system behavior” in 31st International Conference on Distributed Computing Systems Workshops (Minneapolis, MN, 2011), pp. 166–171. [Google Scholar]

PERMALINK

In silico machine learning–enabled detection of polycyclic aromatic hydrocarbons from contaminated soil

Yilong Ju

Oara Neumann

Sara B Denison

Peixuan Jin

Andres B Sanchez-Alvarado

Peter Nordlander

Thomas P Senftle

Pedro J J Alvarez

Ankit Patel

Naomi J Halas

Significance

Abstract

Results and Discussions

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Conclusion

Materials and Methods

Materials.

Synthesis of Au Nanoshells (Au NS).

SERS Substrate Preparation.

Analysis of Solvent-Extractable PAHs from Contaminated Soil.

Filtration extraction.

ASE.

Measurements.

Simulated PAH Spectra.

Computational Methods.

Preprocessing.

Similarity Metrics.

Normal Similarity (NormSim) and Characteristic Peak Similarity (CaPSim).

The WD.

Logistic Regression Model Setup.

AUROC Metric.

Train/Validation/Test Splits.

Cross-Validation Strategy.

Hyperparameter Tuning.

Further Insights into the Discussion of Fig. 4.

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases