Abstract
Wildfire smoke exposures are increasingly common, consisting of complex mixtures of gases and particulates known to cause diverse pulmonary health effects. While health outcomes are regularly studied, quantitative links between smoke chemical composition and toxicological outcomes remain poorly defined, limiting interpretation of wildfire smoke health risks. This study explores symbolic regression (SR) as an interpretable artificial intelligence/machine learning method to generate closed-form mathematical models linking chemical exposure to biological responses relevant to wildfire smoke. Prior to application on wildfire-relevant data sets, we benchmarked three Python-based SR packages on simulated data, assessing performance across varying noise levels and operator complexities. Insights from these simulation tests, such as the importance of including necessary operators, were incorporated when applying SR to lab-generated wildland fire exposure-toxicity data. This data set included chemical characterizations of biomass smoke exposures and corresponding pulmonary responses in female CD-1 mice (n = 60). Specifically, we evaluated the ability to predict a lung injury marker using (1) targeted measures of over 80 chemicals measured in smoke (RMSE = 17.57 mg/mL) and (2) lung tissue measures of hundreds of transcripts (RMSE = 15.12 mg/mL). Resulting error metrics were comparable to Random Forest and XGBoost models. To aid model interpretation, we developed directional ensemble contribution scores (DECS), a novel feature importance scoring method that quantifies the direction and magnitude of predictor contributions across top-performing models. Expert toxicologists also contributed to model prioritization, integrating a “biologists-in-the-loop” approach. Results highlighted polycyclic aromatic hydrocarbons as drivers of lung injury and methoxyphenols as suppressors. Transcriptomic analyses highlighted a small set of genes, which have roles in metabolism, cell proliferation, immune regulation, and oncogenic processes, with MYC proto-oncogene (Myc) showing the strongest association. Overall, this study demonstrates SR and associated DECS as practical, interpretable tools for modeling environmental mixtures, such as wildfire smoke, and their toxicological effects.


Introduction
Wildfire events are increasing in prevalence, size, and intensity worldwide. − Wildfires emit complex mixtures of toxic gases and particulate matter, with highly variable compositions dependent upon fuel sources, combustion conditions, and atmospheric aging. − Known chemicals present in wildfire smoke include alkanes, alkenes, methoxyphenols, polycyclic aromatic hydrocarbons (PAHs), metals, and ionic constituents. − Exposures to wildfire smoke have been associated with many negative pulmonary outcomes, such as increased risk of asthma, chronic obstructive pulmonary disease, infection (e.g., SARS-CoV-2), and interstitial lung disease. − Due to the high variation in exposure conditions and health effects from such complex exposures, wildfire research is presently in need of improved data infrastructure and analysis methods to parse and quantify exposure chemistries and biological disruptions.
Previous studies investigated complex smoke emissions released through lab-based simulations of wildfire events and successfully evaluated their resulting chemical compositions, toxicological outcomes, and omic response profiles. ,,− A large study was designed by our team to specifically compare in vivo pulmonary responses across 10 different biomass smoke exposure conditions in mice. This study, initially published in 2018, presented chemical composition data in conjunction with lung toxicity marker data across each exposure condition. Since this initial publication, we contributed additional data streams including mouse lung tissue transcriptomic signatures, as well as a statistical-based analysis of chemical mixtures related to lung toxicity outcomes. This wildfire database, as it currently stands, can be leveraged in combination with artificial intelligence/machine learning (AI/ML) models to further understand relationships among complex chemistries in smoke and their resulting impacts on biological responses.
Interpretable AI/ML models are increasingly important in toxicology, particularly for building confidence in predictions and supporting translation to human health protection applications. While popular AI/ML approaches such as neural networks, random forests, and support vector machines have been shown to accurately capture patterns in high-dimensional and complex data in nearly every scientific field, these models operate as black boxes where domain knowledge is not included, and consequently often lack transparency and accountability. , Conversely, statistical methods offer transparency but are often limited by predictive accuracy when applied to complex, nonlinear biological data. Although mechanistic modeling remains the gold standard for providing deep biological understanding and causal insights, its complexity and data requirements often make it difficult to implement at scale, necessitating the exploration of complementary modeling approaches. To this end, interpretable and explainable AI/ML are active areas of research, with emphasis on both transparent model structures and the use of biologically meaningful, interpretable input features to support scientific and regulatory decision-making. − By fostering a synergistic relationship between advanced AI/ML and expert-driven insights, such approaches can enhance model trustworthiness and facilitate more informed, data-driven decision-making in toxicological studies.
Symbolic regression (SR) is an interpretable AI/ML method that is currently underutilized in toxicological studies. , Leveraging an evolutionary algorithm approach, SR constructs decision trees that compete against each other within a specified functional space to identify closed-form equations that describe the input data. The output of SR is a suite of mathematical models with accompanying error metrics and a measure of the model’s complexity to help promote parsimony and interpretability. With this approach, a hierarchy (or “hall-of-fame” [HOF]) of all considered models is produced (Figure ). A key advantage of SR is that the user can incorporate domain knowledge and other criteria such as reasonability into the final model selection process alongside performance, an approach we are calling “biologist-in-the-loop” selection. Although SR and conceptually related approaches have been explored for decades, these methods have been implemented to only a limited extent in the health research field, for instance, in the analysis of gene expression in relation to insulin response in obese and never-obese women and treatment of migraine attacks. SR has also been used in systems biology to model drug absorption and glucose–insulin interactions. To our knowledge, this method has not been used to model exposure to environmental chemical mixtures.
1.
Overview of symbolic regression (SR). As an input, SR takes a 2D matrix of independent variables (e.g., chemical molarities) and their corresponding dependent variable (e.g., total protein, TP). From this data set, SR generates mathematical expressions represented as tree structures, where nodes denote arithmetic operations and leaves correspond to variables. These expressions compete and evolve to find optimal equations that describe the data, ultimately forming a hall-of-fame (HOF) of top equations ranked by error and complexity (i.e., a measure of how many operators, variables, and constants were used in the resulting expression tree).
Recognizing the increasing impacts of wildfire smoke exposures on human health, this study evaluates SR as an approach for parsing relationships between chemicals, gene expression, and toxicological outcomes. Specifically, this paper has three goals: (1) establish acceptable practices for implementing SR by evaluating three Python-based SR algorithms on simulated data with a well-defined structure; (2) use data sets previously published from female CD-1 mice exposed to variable biomass burn scenarios to identify and quantify chemical drivers of toxicological responses; and (3) investigate these data sets to identify and quantify genes contributing to toxicity through altered transcriptional expression. To aid in model interpretation, we introduce a novel feature importance scoring framework for SR. With this approach, individual features are assigned a directional ensemble contributions score (DECS), calculated using an ensemble of top-performing models to identify the direction and magnitude of each predictor’s contribution to the outcome. This work provides a reproducible framework for applying SR to toxicological data while offering insights into the chemical and transcriptomic drivers of toxicity resulting from wildfire-relevant smoke toxicity.
Methods
Overview of Data and Study Organization
Data used in this study were generated by lab-based simulations of biomass sources and combustion conditions encountered during wildfire events. To summarize, 10 distinct biomass burn conditions were produced through the burning of common biomass materials (eucalyptus, peat, pine, pine needles, and red oak) separately under flaming and smoldering combustion conditions in a quartz tube furnace system. Smoke emitted from these conditions were collected and stored as biomass smoke condensates, evaluated for chemical composition using gas chromatography–mass spectrometry (GC-MS), and 86 chemicals were quantified for downstream analysis. The condensates were administered via oropharyngeal aspiration to female CD-1 mice for toxicity testing, with six mice per exposure group (n = 60 total). , Biological samples were then collected, including bronchoalveolar lavage fluid (BALF) for pulmonary toxicity marker analysis (e.g., total protein) and lung tissue for transcriptomic profiling. , A more thorough description of the experimental methods and data set is provided in Kim et al.
The current study builds upon this research by applying SR to build mathematical models using these data to further describe relationships between (1) chemicals and lung toxicity and (2) genes and lung toxicity, using toxicological end points measured 4 hours post-exposure. , Prior to building these SR models, simulation testing was carried out to compare different SR packages currently available and optimize model test parameters (referred to as “Part One” in the methods/results). Then, the best-performing package and parameters were applied to the chemical signature and lung toxicity data (referred to as “Part Two”), and finally, the transcriptomic signature and lung toxicity data (referred to as “Part Three”).
Overview of SR Methodologies
SR generates analytic equations of the general form
| 1 |
where y is the dependent variable, X = {X 1, X 2, ··· , X n } are the independent variables, and β = {β1, β2, ···, β n } are coefficients. Utilizing an evolutionary algorithm approach, SR constructs expression trees that represent equations that compete against each other based on an error and complexity metric to explore the allowable function space. This process yields a ranked hierarchy, or “hall-of-fame” (HOF), of top-performing models. The functional space is restricted by the specification of allowable operators (e.g., addition, subtraction, and exponentiation) and whether constants are allowed. In this analysis, we compare three Python-based SR packages: gplearn, PySR, and feyn, which follow similar evolutionary principles but differ in their optimization strategies, described as follows:
-
1.
gplearn (v0.4.2) employs a classical genetic algorithm, where models are represented as tree structures. Evolution occurs through crossover (swapping subtrees) and mutation (random modifications), refining models over successive generations.
-
2.
PySR (v0.19.4) implements a multipopulation evolutionary approach, evolving multiple populations asynchronously. It uses tournament selection to choose and mutate models, replacing the oldest models in a cycle.
-
3.
feyn (v3.4.0) utilizes the QLattice algorithm, blending evolutionary strategies with neural networks. Instead of pure mutation, it refines a probability distribution of inputs and operators using gradient descent, guiding the search toward optimal models.
Part One: Simulation Testing to Verify Model Validity
To assess the accuracy of SR models under controlled conditions, we first tested the three packages on simulated data sets with varying levels of background noise. This allowed us to evaluate model performance and identify key parameters influencing accuracy, guiding application toward our wildfire-relevant data set and building confidence in these methods (Figure A).
2.
Overview of experimental design. (A) To assess the performance of various Python-based symbolic regression (SR) packages, data were simulated for 15 random variables derived from biologically relevant statistical distributions. These variables, combined with normally distributed noise, were used to generate 10 SR input data sets. Each data set was analyzed under three levels of allowed mathematical operators, resulting in a total of 30 tests conducted using the SR packages gplearn, PySR, and feyn. (B) To predict TP, a marker of lung injury, SR was applied to 86 chemical concentrations, an elastic net-reduced data set, and a principal component analysis (PCA)-reduced data set. Individual chemicals were ranked using a novel variable importance scoring metric, and the final model was selected with expert biologist input. (C) SR was applied to predict TP using 371 differentially expressed genes identified by Koval et al., as well as Lasso- and PCA-reduced data sets. Novel variable importance scoring was applied, followed by biologist-in-the-loop selection.
Simulation of Correlated Data
Simulated measurements for 15 variables were generated across 50 samples, paralleling data dimensions implemented in Parts Two and Three below, using R (v4.3.2). Relationships between variables were enforced via a correlation matrix (r = |0.2–0.5|), converted into a positive-definite covariance matrix using nearPD (Matrix v1.6–5), and sampled from a multivariate normal distribution using mvrnorm (MASS v7.3–6). To mimic real-world variability, individual variable distributions were transformed to uniform, normal, beta, or log-normal distributions with varied location, scale, and shape parameters.
Following simulation of variable measurements, a response variable was added to the matrix using the following equation:
| 2 |
Equation 2 was developed to reflect the types of relationships anticipated in our downstream chemical and transcriptomic analyses. Specifically, it incorporates variables drawn from distinct distributions and combines them using multiple mathematical operations introducing moderate nonlinearity and complexity. To assess robustness, additional response variables were generated by adding normally distributed noise (mean = 0, standard deviation = 0–2, increasing by 0.25 for each subsequent response).
Implementation of Python-Based SR Packages on Simulated Data
Each SR package was evaluated based on its ability to identify the true relationship (eq ) from the simulated data. For these tests, both the input data and the operator space complexity were varied. For the input space, this included inputting a matrix with either just the variables used in the response equation, all simulated variables excluding noise, or all simulated variables with various levels of noise. For the operator complexity, three levels were used:
-
1.
Low: an overly simple operator space, which did not include all necessary operators (+, −)
-
2.
Medium: a “just right” operator space, which included just the necessary operators (+, −, *, /)
-
3.
High: an overly complex operator space, which included unnecessary operators (+, −, *, /, log, sqrt)
The specifics of this design are outlined in Figure A. Each package was implemented in Python (v3.12.4), and a total of 500 iterations were used for each test. Package-specific parameters are included in Table .
1. Package-Specific Parameters Used during SR.
| package | complexity control | loss function |
|---|---|---|
| gplearn | parsimony_coefficient = 0.25 | root mean squared error (RMSE) |
| p_hoist_mutation = 0.05 | ||
| max_samples = 0.3 | ||
| PySR | Default setting | RMSE |
| feyn | max_complexity = 10 | absolute error during optimization, RMSE post hoc |
Metrics for Assessing Performance on Simulated Data
Results were evaluated based on RMSE and the accuracy of the top equation for each package-input combination. While the ideal outcome was to recover eq exactly, we wanted to also highlight cases where the output equation was partially correct. To achieve this, each output equation was assigned one of four accuracy levels:
-
1.
Level 1: Correctmatches eq exactly
-
2.
Level 2: Partially correctincludes only variables in eq and includes at least one correct variable relationship (e.g., X 2 * X 6 or X 10/X 13)
-
3.
Level 3: Incorrect without extraneous variablescontains only variables in eq but has incorrect variable relationships
-
4.
Level 4: Incorrect with extraneous variablesincludes variables not in eq
In cases where the RMSE was nonzero only due to an incorrect leading coefficient, the equation was still counted as correct if the coefficient was within 1/100th of the actual value (e.g., 2.99 instead of 3). These analyses were conducted on a Windows 11 Pro system (Version 24H2, Build 26100.3194), equipped with an AMD Ryzen 9 7950X 16-core processor (4.50 GHz) and 64 GB of RAM (two 32 GB modules at 6000 MHz), and the “time” function was used to keep track of the length of individual runs.
Part Two: Prediction of Lung Injury Using Chemical Exposure Data
The next goal of this work was to apply SR to chemical measurements from wildfire event simulations to predict total protein (TP) (Figure B). TP reflects the total amount of protein contained in lung fluid samples (here, bronchoalveolar lavage fluid), as measured using the Coomassie Plus Protein Assay (Pierce Chemical). Increased concentrations of TP in lung fluid indicates an increase in airway cell injury (e.g., increasing cell membrane permeability) and/or an increase in protein signals emitted from cells in stress (e.g., protein cytokine release to recruit immune cells in response to injury). Thus, TP represents a general marker of toxicity capturing various mechanisms of action elicited by chemical exposures and is commonly used in inhalation toxicology studies. ,, As further detailed in the results, PySR’s SR algorithm performed the best across all model simulations and therefore was used for analyses in Parts Two and Three.
Preprocessing of Chemical Exposure Data
The chemical exposure data set consisted of 86 chemical concentrations across the different biomass smoke samples, including alkanes, alkenes, inorganic atoms and molecules, ionic constituents, methoxyphenols, levoglucosan, and polycyclic aromatic hydrocarbons (PAHs). Prior to analysis, chemical concentrations below the limits of detection (LOD) were imputed by dividing the LOD by the square root of 2. All chemical concentrations were then converted to log(mol/L). TP remained in mg/mL because the diverse protein mixture prevented determination of a precise molar mass needed for molar conversion. Forty percent of samples were randomly assigned to the testing set, which was withheld during model construction.
Generation of Reduced Chemical Exposure Data Inputs
Given that the number of chemicals is greater than the number of samples in this data set (commonly referred to as the p > n structure in large data sets) and that evolutionary algorithms may struggle to find the optimal solutions when the parameter space is large, we explored dimensionality reduction strategies. Specifically, we evaluated principal component analysis (PCA) and regularization-based feature selection via elastic net regression. PCA was implemented using the PCA function from sklearn.decomposition. Elastic net regression was applied using ElasticNetCV from sklearn.linear_model, with 3-fold cross-validation to determine the optimal regularization parameter. This method shrank the coefficients of less significant variables, and we retained only those with nonzero coefficients.
Application of Symbolic Regression to Full and Reduced Chemical Exposure Data Sets
Each of the three inputsthe full data set, PCA-reduced, and elastic net-reducedwere run separately for a total of 500 iterations, representing a number of iterations providing robust model convergence. Available mathematical operations included addition, subtraction, multiplication, division, and exponentiation. Resulting HOF equations were assessed by calculating the corresponding RMSE on both the training and testing data. A candidate model equation was selected using error metrics, variable importance scoring, and expert-driven domain knowledge.
Part Three: Prediction of Lung Injury Using Transcriptomic Data
In a parallel analysis, we applied SR using PySR to transcriptomic data from mouse lung tissues exposed to wildfire smoke to predict total protein (TP) (Figure C).
Preprocessing of Transcriptomic Data
To streamline this analysis, we utilized genes that were differentially expressed in mouse lung tissue, as identified by Koval et al. To be included in this study, genes were required to have an adjusted p-value (p adj) < 0.1 and an absolute fold change >2.5 in at least one exposure condition, yielding a total of 371 genes. Samples were split into training and testing sets as was done in the chemical exposure-based analysis.
Generation of Reduced Transcriptomic Data Inputs
To further reduce input transcriptomic data, PCA and Least Absolute Shrinkage and selection operator (Lasso) regression were applied. While the application of PCA directly parallels the approach used on the chemical data, the strong regularization provided by Lasso was preferred over elastic net due to the larger number of starting variables and highly correlated data structure. We applied PCA (PCA from sklearn.decomposition) and Lasso regression (LassoCV from sklearn.linear_model) before SR modeling.
Application of Symbolic Regression to Full and Reduced Transcriptomic Data Sets
Each of the three transcriptomic data scenarios (full data set, PCA-reduced, and Lasso-reduced) were analyzed for 3000 SR iterations. This marks an increase compared to the chemical data, which is due to the larger parameter space requiring increased iterations to reach convergence (see “Model convergence and computational efficiency across simulated tests” below). Available mathematical operations included addition, subtraction, multiplication, division, and exponentiation. The final model equation was again selected using error metrics, variable importance scoring, and expert-driven domain knowledge.
Novel Variable Importance Scoring
While SR is lauded for its interpretability, a noticeable deficiency of SR compared to other methods is the lack of variable prioritization (e.g., Gini-based importance and beta coefficients). To amend this discrepancy, we developed an approach for assigning variable importance by considering how consistently a given variable appeared in the HOF and the direction/magnitude of its influence on resulting predictions for these occurrences. We began by creating an ensemble of top-performing models by collecting all HOF equations from each iteration and then removing those with high error on the training data (RMSE ≥19 mg/mL for chemical data and RMSE ≥17 mg/mL for simulated and transcriptomic data). For each variable appearing in the resulting ensemble, we calculate its directional importance (V i ) as
| 3 |
where i is a variable identifier and Ω is the domain of integration and is chosen to be the bounds of the independent variable,
| 4 |
To assess the importance across all models in which the variable appeared, the individual contributions were summed for each variable and normalized by the variable’s observed data range. Final contribution scores were scaled between −1 and 1 by dividing by the maximum absolute contribution.
Comparison to Random Forest and XGBoost
To determine how the resulting SR models compared to more traditional machine learning approaches used in toxicological modeling, random forest (RF) and extreme gradient boosting (XGBoost) models were trained for the chemical and transcriptomic data sets. The RF models were implemented using RandomForestRegressor from scikit-learn using 10,000 trees. Feature importance ranking was used using the mean decrease in impurity. In parallel, XGBoost models were implemented using the XGBRegressor from the xgboost Python library with 2000 boosting rounds, a learning rate of 0.05, maximum tree depth of 6, subsampling and column sampling rates of 0.8, and squared error as the objective function. For XGBoost, variable importance was calculated using gain-based importance scores. For both approaches, model performance was evaluated using root mean squared error (RMSE) on the held-out test data used for SR.
Data Visualization and Code Availability
Resulting heatmaps, bar charts, and line charts were created in R (v4.3.2) using the packages ggplot2 (v3.4.4) and pheatmap (v1.0.13). , Scree plots resulting from PCA were created in Python (v3.12.4) using matplotlib (v3.9.2). Schematic visualizations and surrounding figure layouts were produced in Biorender.com. All analysis scripts and data used in this manuscript are publicly available on the Rager lab GitHub site (https://github.com/Ragerlab), with the complete chemistry data set from its parent publication available on the Rager lab Dataverse (https://dataverse.unc.edu/dataverse/ragerlab) and the complete transcriptomic data from its parent publication available on Gene Expression Omnibus under series GSE164542.
Results and Discussion
Study Overview
This study evaluated practical approaches for applying symbolic regression (SR) in toxicology and used these approaches to identify chemical and transcriptomic drivers of wildfire smoke lung injury. As shown in Figure , we (1) benchmarked three SR packages on simulated data across operator sets and noise levels, (2) modeled total protein (TP) using smoke chemical profiles with alternative input-reduction strategies (PCA and elastic net), and (3) modeled TP using transcriptomic predictors with analogous reduction strategies (PCA and Lasso). Across Parts Two and Three, model interpretation and selection leveraged DECS and expert review, and SR performance was benchmarked against random forest and XGBoost.
Part One: Simulated Tests
Overview of Data Simulation
The goal of SR is to identify analytic model equations that best represent a given data set in terms of both model fit and system feasibility. Assessing whether SR has identified this best fit can be challenging, especially in a toxicological context where data from model systems tend to be noisy and variables may have complex relationships tied to tightly regulated biological processes. For this reason, we opted to apply SR to a simulated data set prior to testing it on the wildfire simulation data to better understand how parameter selection impacts final results. To achieve this, measurements were simulated based on various distributions for 15 variables across 50 samples and a response outcome was calculated according to eq . When simulating this data, controlling the correlation structure across variables was imperative to ensure that variables not included in the response equation appear to be related rather than just noise, allowing for a more realistic scenario for SR to process. Additionally, we aimed to prevent variables in the response equation from being too highly correlated with the response itself, as this would oversimplify the problem. The heatmap displayed in Figure S1 demonstrates the correlation structure of the simulated data, showing varying degrees of correlation among the variables and the response. Expectedly, X2 and X6 both have moderate, positive correlations with the response variable, while X13 has a milder positive correlation and X10 has a negative correlation. Following simulation of the baseline variables, a variety of test cases were defined to evaluate variation in the amount of noise present in the data, as well as changes in the available operators for the algorithm to choose from. In total, this design resulted in 30 test cases for each package, which are outlined in Figure A.
Model Convergence and Computational Efficiency across Simulated Tests
As all three SR packages rely on an iterative optimization algorithm, the maximum number of iterations is an important parameter, with too few risking missed solutions and too many incurring unnecessary computational cost. To determine the optimal number of iterations for our workflow, we examined how error rates changed across iterations. Representative plots are shown in Figure S2, which show that for most model complexities, the error rate no longer decreased after ∼100 iterations, and very few change at all after ∼300 iterations. These stability findings suggest that additional iterations are unlikely to yield performance gains and therefore are not necessary for the current analysis. The computational time for all test cases is outlined in Table S1 as both a total run time and average per test (total/30). This run time analysis revealed that PySR and gplearn are computationally more efficient than feyn, performing approximately 7.5× and 5.6× faster, respectively. While this difference seems large, all tests were still completed in under 7 h, making them practical for most timelines. However, for higher-dimensional inputs (e.g., gene expression with thousands of variables), runtimes may become limiting, and packages may not scale equally as data set size increases.
Variation in RMSE, Accuracy Level, and Variable Importance Based on Input Complexity, Operator Complexity, and Python Package
When evaluating how well the resulting models captured trends in the input data, both the RMSE of the model and how closely the model equation compared to eq (i.e., accuracy levels outlined in the Methods) were considered (Figure ). The ability to concurrently explore both of these metrics represents an advantage of using a simulated approach. While RMSE can be calculated on real data, it is agnostic to the “truth” and only measures how well the model fits the data. By combining the RMSE with comparison to the true underlying equation, we can identify cases where the RMSE is low, but the equation fails to capture the correct variable relationships. These results are visualized in heatmaps with the complexity of the input data (i.e., number of variables and amount of noise in the response) varied along the x-axis and the complexity of the operator space (i.e., the number and type of allowed mathematical operators) varied along the y-axis (Figure ). Broadly, both the RMSE and equation accuracy level tend to worsen as the input complexity increases since the addition of unnecessary variables increases the parameter space and additional noise confounds signal. For the operator complexity, there is a considerable improvement in both RMSE (average decrease of 8.56) and accuracy level when moving from low to medium, which demonstrates that the correct variable relationships cannot be formed without the inclusion of the multiplication and division operators. Comparing the medium and high operator complexity, the RMSE results were comparable (RMSE = 3.69 and 2.39, respectively), with higher accuracy levels achieved for the high operator complexity.
3.

Quality of SR results vary based on the Python package when applied to simulated biological data. (A) Root mean squared error (RMSE) for the 30 tests outlined in Figure split by input complexity, operator complexity, and Python package. (B) Accuracy of corresponding model equations for each of the 30 tests outlined in Figure ranked using the following scale: (1) Correct, (2) Partially correct, (3) Incorrect without extraneous variables, (4) Incorrect with extraneous variables or just a constant.
Performance of the three SR packages varied across the different input and operator complexity combinations, with PySR capturing the correct variable relationships most consistently. At the low operator complexity, feyn tended to have much lower RMSE than gplearn and PySR. However, this difference can be attributed to the fact that exclusion of multiplication in gplearn and PySR means that multiplication cannot occur at all, while the feyn package only disallowed variable multiplication. Consequently, feyn was able to improve performance by including leading constants. While all packages demonstrated RMSE improvements when moving from the low to medium operator complexity, these responses are distinguished when comparing equation accuracy. In the case of gplearn and feyn, the correct equation is never correctly identified, but a relationship between X 2 and X 6 is frequently captured. This result is improved upon by PySR, where the equation was correctly returned for all input complexities. This trend persists in the high operator complexity space, where gplearn and feyn often recapitulate the multiplicative nature of X 2 and X 6, but fail to correctly incorporate X 10 and X 13 accurately. What further distinguished packages in the high operator complexity space were the types of operators included. While gplearn and PySR did not include unnecessary mathematical operators, feyn frequently incorporated logarithms. In summary, we identified three general considerations from this simulation testing that are important when applying SR to real data: (1) it is imperative to run enough iterations to reach convergence, (2) correct variable relationships cannot be identified without inclusion of the necessary operators, and (3) a low error metric does not necessarily indicate accuracy, highlighting an advantage of biologist-in-the-loop selection where feasibility can be considered.
In addition to validating that this approach is suitable for biological data and identifying the best-performing package, the simulation study was also used to validate the proposed DECS framework. The goal of DECS is to assign an importance score to a variable based on how consistently it showed up in the HOF and the direction/magnitude of its influence on resulting predictions for these occurrences. DECS were generated based on the PySR results for each of the 15 simulated variables across the 30 test cases (Figure S3). In test cases where PySR was able to recover eq (e.g., the medium and high complexity spaces), X 2 followed by X 6 were consistently identified as the strongest positive contributors, while X 10 was consistently assigned a negative association with the outcome. This aligns with eq , which has X 2 and X 6 multiplicatively associated with the outcome, while X 10 appears in the denominator and acts to attenuate the outcome. In addition, X 13 is one of the few additional variables that is repeatedly assigned a nonzero DECS, but its magnitude is notably smaller than those of X 2, X 6, and X 10. This is expected because X 13 contributes as an additive component, whereas X 2 and X 6 are multiplicative. The division by X 10 further moderates X 13’s contribution, making its apparent effect more variable across cases. Altogether, these results support that DECS are able to identify variables in a way that is reflective of the underlying response equation, therefore increasing our confidence in their utility for downstream analyses.
Part 2: Chemical Exposures
Performance Variation Based on Input Data
During wildfire simulations, chemical concentrations were measured for a total of 86 chemicals, spanning alkanes, alkenes, inorganic atoms and molecules, ionic constituents, levoglucosan, methoxyphenols, and PAHs. To directly link exposures to downstream molecular changes, we aimed to use these chemical concentrations to predict lung toxicity captured through measures of total protein (TP) in individual female CD-1 mice BALF samples using SR. Increased TP reflects epithelial barrier disruption and airway cell injury, including increased membrane permeability and stress-related protein release, and is therefore widely used as a general marker of pulmonary toxicity in inhalation studies. − In addition to using the full set of chemicals as input for SR, we also assessed a PCA-reduced set consisting of four PCs (Figure S4) and an elastic-net-reduced set consisting of 31 chemicals (Table S2). For each of these inputs, the maximum number of iterations was set to 500. Convergence plots for each of these approaches are presented in Figure S5. No notable difference was observed in the computational efficiency (i.e., time) across these three approaches. Balancing interpretability and performance, we highlight the results of the elastic-net-reduced input below (Figure ).
4.
Resulting SR variable rankings and TP predictions using the elastic net-reduced data set. (A) DECS for the top 15 chemicals. Positive values indicate a positive association with TP, while negative values indicate an inverse association with TP. (B) RMSE for training and testing predictions using the final HOF equations ordered by complexity. The chosen equation is circled. (C) Individual TP measurements (bars) and predictions (red lines) on the testing data using the selected HOF equation.
Feature Importance Ranking Based on DECS
Examining DECS for the chemical data, two main trends emerge (Figure A). First, PAHs, such as benzo(k)fluoranthene, acenaphthylene, and benzo(b)fluoranthene, stood out as potent contributors to increased lung injury risk. Consistent with prior literature, the association of PAHs with elevated TP aligns with well-established links between PAH exposure and pulmonary injury, which have been attributed in part to oxidative stress, inflammatory cytokine signaling, and activation of the aryl hydrocarbon receptor (AhR) pathway. − Interestingly, dibenzofuran and, to a smaller extent, retene deviated from this trend and were negatively associated with TP in our models. While this finding is consistent with the correlations between these two chemicals and TP in this data set (Figure S6), it is not supported by existing literature, which suggest that dibenzofuran and retene adversely affect lung health. − For this reason, we approach models incorporating these chemicals with caution, further discussed below. The second significant trend is that methoxyphenols, including eugenol, guaiacol, propylguaiacol, and ethylguaiacol, were found to be negatively associated with TP. This finding aligns with our previous mixture statistical-based study, which suggested a protective effect of methoxyphenols in these same mice, as well as several other studies demonstrating anti-inflammatory and protective effects in the lungs. − Comparing DECS to complementary variable importance scores from RF and XGBoost (Figures S7 and S8), we note several similarities across the three measures. Specifically, all three models have eugenol within the top two features with highest importance scores, and all models contain guaiacol or one of its derivatives in the top five, further underscoring an association with methoxyphenols. Additionally, we note that all three measures contain dibenzofuran in their top 15, and both RF and SR contain retene in their top 15. However, due to RF and XGBoost importance scores being unidirectional, we cannot determine if these chemicals contributed positively or negatively to the predicted TP values on average, highlighting an advantage of the new variable importance ranking approach using DECS.
Biologist-in-the-Loop Model Selection
To select a candidate model equation from the HOF (Table S3), a biologist-in-the-loop approach was used, which incorporated DECS (Figure A), associated error (Figure B), and expert domain knowledge. Combining all pieces of information, the following model was selected as an example among the top-performing models:
| 5 |
Because the chemicals are expressed as log-transformed mol/L values, which yield negative values over the given concentration range, increasing BkF (i.e., denominator becomes less negative) increases predicted TP, whereas increasing eugenol (i.e., numerator becomes less negative) results in a decrease in predicted TP. The inverse relationship between eugenol exposure and inflammation in the lung has been shown across several studies. For example, rodent studies found that eugenol treatment following exposure to lipopolysaccharide (LPS) decreased pulmonary inflammation and remodeling, as demonstrated by reduced neutrophil and total inflammatory cell counts in BALF, lower BALF protein levels (reflecting reduced alveolar-capillary barrier disruption), and attenuated lung histopathology. These improvements were accompanied by better lung mechanical function and reduced expression of pro-inflammatory cytokines and remodeling-associated markers, indicating broad suppression of LPS-induced inflammatory injury in the lung. ,,, Beyond LPS, eugenol treatment following exposure to diesel exhaust particles was shown to reduce inflammatory lung damage and dysfunction, including less alveolar collapse and lower apoptotic signaling. Studies that evaluate benzo[k]fluoranthene (BkF) as a standalone inhaled toxicant are relatively limited, in part because PAHs are commonly measured and tested as complex mixtures (e.g., combustion-derived PM and smoke) rather than as single compounds. Even so, there is some evidence linking BkF to adverse pulmonary outcomes. In human lung cancer samples, BkF has been reported as one of the major PAH components detected, with levels significantly higher than in healthy tissue. An in vivo mouse study also demonstrated that exposure to BkF caused systemic immunosuppressive effects, which could plausibly worsen susceptibility to downstream lung injury or infection in inhalation-exposure contexts. Mechanistically, PAHs like BkF can drive lung toxicity by binding AhR, resulting in the upregulation of CYP1 enzymes. This induction can promote metabolic bioactivation and increase reactive oxygen species (ROS) generation, driving oxidative stress and redox imbalance and, in some contexts, formation of DNA adducts that contribute to mutagenesis and carcinogenic processes. − Together, AhR-CYP induction and ROS production provide a plausible route by which BkF and other PAHs could amplify epithelial injury and barrier dysfunction, consistent with increased TP.
The chemicals highlighted in eq capture plausible mechanistic contributors to lung toxicity, with error comparable to parallel modeling approaches (Table S4). However, it is worth noting that this equation does not yield the lowest error on the withheld testing data (Figure B), which is given by
| 6 |
While this model contains some elements contained in eq , it also expresses a substantially more complex functional form (i.e., multiple fitted constants and a higher-order interaction term), which increases the risk of overfitting and reduces interpretability. In addition, there is minimal evidence supporting dibenzofuran exposure being linked to decreased pulmonary inflammation/stress. As the goal of predictive modeling is to create models that can be extended to unseen data sets, we chose not to proceed with this equation due to limited evidence that this pattern would be reproducible. This decision reflects the biologist-in-the-loop objective of balancing predictive performance with mechanistic coherence and plausibility, and other high-performing equations were eliminated using similar reasoning. However, we note that selecting a singular representative equation can underrepresent predictors that contribute consistently across the broader set of well-performing solutions. Accordingly, DECS, which are computed over the full HOF ensemble, should serve as a complementary interpretation piece alongside the selected model to capture broader trends in the data. In cases where multiple equations are both well-supported and mechanistically plausible, it is also feasible to generate ensemble predictions by averaging across a subset of these equations, providing a way to align prediction with the ensemble-level interpretation captured by DECS. For example, model 8 in the final HOF incorporates acenaphthylene and Na (Table S3), which also had high-ranking DECS, and could be explored as a complement to eq . Importantly, we view under-supported predictors, such as dibenzofuran, as hypothesis-generating features that can guide targeted laboratory experiments to further investigate and mechanistically characterize their roles.
Resulting TP Predictions Using Chemical Data
Applying eq , we obtain a final RMSE of 17.57 mg/mL on the test data. This compares to an RMSE of 17.26 mg/mL obtained by the RF model and 16.88 mg/mL obtained by the XGBoost model (Table S4). The magnitude of error on an individual level can be seen in Figure C, where measured TP levels for individual mice are represented as bars, while predicted TP levels based on eq are indicated by the corresponding red dash. A key trend that emerges from this data is that the magnitude and direction of error for these individuals varies considerably based on exposure group. This suggests that predictions may improve by implementing exposure specific models, which can be achieved as sample databases are expanded. All mice within the same exposure group shared identical predicted values, reflecting the uniformity of their chemical exposures. Despite having the same exposures, there is variability in the TP levels within groups, indicating individual variability in response to exposure, possibly due to biological factors such as genetic background, metabolic differences, or immune response variability. For example, the predictions yielded for the first two mice in the flaming peat exposure group match the actual protein concentration almost perfectly but is a considerable overestimate for the third mouse. The presence of these variations underscores the importance of considering both group-level trends and individual deviations.
Part Three: Transcriptomics
Performance Variation Based on Data Input
To identify gene predictors of toxicological responses while incorporating individual biological variability, we next evaluated the feasibility of using transcriptomic data as input to SR for predicting lung toxicity as measured through TP. To test this approach, we leveraged the work of Koval et al., which identified DEGs in the 10 biomass burn scenarios. For the purposes of this study, a relatively strict statistical threshold was applied (|FC| ≥2.5 and p adj <0.10) to prioritize genes that displayed some of the greatest changes in expression. This filter yielded 371 genes that served as input to SR. While one could bypass this DEG-based filtering and use the full set of detected genes, optimization algorithms, such as those used in SR, tend to degrade in performance as the parameter space expands. Furthermore, the high correlation among some DEGs due to shared pathways can compound this issue. Consequently, we also employed Lasso regression and principal component analysis (PCA) as data reduction strategies (Figure S9). Based on our sensitivity analysis, the transcriptomic data required more iterations to reach model convergence and therefore each input was run for a total of 3000 iterations (Figure S10). The Lasso-reduced input, containing 31 genes, consistently minimized the root-mean-square error (RMSE), indicating that reducing the parameter space improved predictive accuracy (Tables S4 and S5). Therefore, the SR results detailed below and shown in Figure focus on the Lasso-reduced gene set.
5.
Resulting SR variable rankings and TP predictions using Lasso-reduced DEGs. (A) DECS for the top 15 genes. Positive values indicate a positive association with TP, while negative values indicate an inverse association with TP. (B) RMSE for training and testing predictions using the final HOF equations ordered by complexity. The chosen equation is circled. (C) Individual TP measurements (bars) and predictions (red lines) on the testing data using the selected HOF equation.
Feature Importance Ranking Based on DECS
In contrast with the DECS for the chemical data, the transcriptomic data narrowed in on a few critical genes (Figure A). Specifically, MYC proto-oncogene (Myc) displayed the strongest positive association with TP, likely reflecting its well-established involvement in cell proliferation, metabolic regulation, and stress response pathways. Akyrin repeat and suppressors of cytokine signaling (SOCS) box containing 14 (Asb14) showed the strongest negative association, indicating a potential protective or regulatory function. The heavy reliance on specific genes is in contrast with the variable ranking yielded by RF and XGBoost analysis, where the importance scores were more evenly distributed across multiple genes (Figures S11 and S12). In particular, Asb14 and Gm21190 ranked highest in the RF and XGBoost feature importance analysis, whereas Myc, despite its strong DECS value, exhibited a lower relative importance. This discrepancy between DECS and RF/XGBoost importance scores likely arises from fundamental differences in how these methods assess relationships within the data: In SR, candidate equations evolve across generations, with well-performing models persisting and propagating through the search process. As a result, predictors that exert strong, stable effects on the response tend to recur in high-performing equations and remain represented in the HOF, leading to elevated DECS values. In contrast, ensemble tree methods distribute predictive signal across many correlated or interacting predictors: RF importance is typically based on impurity reduction or permutation effects across trees, while XGBoost importance reflects how often and how beneficially a feature is used for splits (e.g., gain/cover/weight), which can spread importance across multiple variables even when a subset has strong direct effects. Thus, SR with DECS provides a complementary perspective to ensemble methods by emphasizing parsimonious, directionally consistent predictors.
Biologist-in-the-Loop Model Selection
In addition to variable importance insights, the training and test RMSE trends for the HOF equations were considered. Examining Figure B, it can be observed that starting at HOF equation 8, there is a large discrepancy between the training and testing performance. This indicates that the more complex models are likely being overfit to the training data and consequently do not generalize well to the test data. Therefore, simpler equations were preferred when selecting a final model. Incorporating expert opinion, the following model was selected:
| 7 |
This model incorporates genes that were high ranking based upon the derived DECS and was further prioritized based upon biological knowledge of roles in pulmonary toxicity. The Myc gene (or MYC in human models) is a key regulator of cell growth and proliferation, and dysregulated MYC expression, reflecting uncontrolled cell growth that is a hallmark of cancer, is among the most common oncogenic events in human malignancies. , Notably, MYC expression has been shown to increase in a dose-dependent manner following exposure to benzo[a]pyrene in normal human bronchial epithelial cells. Cyp1a1 also showed a positive association with TP in mice, providing a direct mechanistic connection to the chemical results, as CYP1A1 is a canonical AhR-responsive xenobiotic-metabolizing enzyme that is commonly induced by PAH exposure. Asb14 encodes an ankyrin repeat and SOCS box-containing protein implicated in ubiquitin-mediated regulation of cytokine signaling, suggesting a potential role in dampening inflammatory pathways and cell proliferation. Therefore, a positive association between TP and these genes is consistent with existing mechanistic knowledge.
Resulting TP Predictions Using Transcriptomic Data
Using eq to predict TP values, we obtain an RMSE of 15.12 mg/mL. In addition to being more interpretable, this represents a slight performance advantage over RF and XGBoost, which yielded RMSEs of 15.77 and 17.58 mg/mL, respectively (Table S4). Furthermore, this model demonstrates an improvement over the model based on chemical exposure data (RMSE = 17.57 mg/mL). We hypothesize that this improvement is attributable to the enhanced granularity of the data due to looking at individual differences, as well as genes representing a measure closer to the protein level. Examining specific predictions, the resulting model trends toward overestimation for samples that have relatively low TP values, as seen in the lowest TP sample in the following groups: flaming pine, flaming peat, flaming pine needles, smoldering pine needles, smoldering red oak, smoldering pine, and smoldering peat. This suggests that the model may struggle to accurately capture the lower end of the present TP range. The magnitude and direction of error is also somewhat dependent on the exposure group. For example, the predictions for smoldering eucalyptus samples underestimate TP. The previous work by Koval et al. demonstrated that each exposure group has a unique transcriptomic signature, and therefore it is unsurprising that a singular model representing all groups is unable to capture all nuances. This, again, underscores the potential for exposure-specific models based on expanded data sets.
Limitations and Future Directions
This study demonstrates the utility and promise of SR in the field of chemical toxicology, uncovering interpretive, closed-form relationships connecting chemicals, omics, and toxicological end pointsthough we recognize several limitations that warrant discussion. First, the relatively modest sample size used here prevents the development of exposure-specific models that could improve generalizability. Moreover, the biological response data in this study focused on a single acute postexposure time point. This snapshot may overlook important temporal dynamics in lung injury responses, as inflammatory and molecular processes can evolve markedly over different phases of recovery or disease progression. Additionally, this study focused on protein concentrations in BALF as the outcome of interest, which is only one of several potential inflammatory response end points in the lungs. As our group is actively producing additional in vivo and in vitro wildfire exposure data streams, future work will focus on harmonizing all data sources to explore the variability between exposure groups and across relevant time windows. Furthermore, we aim to expand these models by incorporating additional omics technologies (e.g., metabolomics, epigenomics, and expanded exposomics), which we anticipate will not only improve model performance but also enhance mechanistic insights, as well as the addition of biological variables such as age and sex. Finally, as we deepen our understanding of the chemical composition of wildfire smoke, we plan to constrain future SR models by integrating chemical reaction-based insights, thereby enhancing both the feasibility and practical applicability of the resulting models.
Conclusions
This study explored the use of SR as an interpretable AI/ML approach for modeling toxicological outcomes associated with wildfire smoke exposures. Initial simulation studies showed that while complexity of the operator set and data can hinder convergence, carefully chosen function spaces and parameter settings enable SR to reliably identify correct variable relationships, even in the presence of noise. Building on these insights, we applied SR to experimental data from mice exposed to chemically distinct biomass smoke scenarios, identifying key chemical and transcriptomic drivers of lung injury. For chemical data, SR models built using a panel of 31 chemicals were able to accurately predict TP, performing comparably to RF and XGBoost with an RMSE of 17.57 mg/mL, while consistently highlighting PAHs (e.g., BkF) and specific methoxyphenols (e.g., eugenol) as top contributors to pulmonary toxicity. These models provided insight into both the direction and magnitude of chemical effects. Likewise, SR modeling of transcriptomic data, using a set of 31 genes, achieved strong predictive performance (RMSE = 15.12 mg/mL) and revealed a prioritized gene set including Myc, Cyp1a1, and Asb14 with strong predictive utility, suggesting that molecular end points closer to protein-level changes may offer improved predictive power over chemical measurements alone. Newly introduced DECS enhanced interpretation of these results by quantifying the contribution and directionality of each predictor across an ensemble of best-fit SR models. These findings highlight the potential of SR for advancing predictive toxicology by providing transparent, mathematically defined models that enhance interpretability and facilitate translation into regulatory and public health applications. Future work should focus on expanding data sets, refining exposure-specific models, and integrating additional omics data to further improve predictive power. By bridging the gap between AI-driven modeling and human expertise, SR offers a powerful tool for uncovering critical exposure-response relationships and informing evidence-based interventions to mitigate wildfire smoke-related health risks.
Supplementary Material
Acknowledgments
The research described in this article has been reviewed by the Center for Public Health and Environmental Assessment, U.S. Environmental Protection Agency and approved for publication. Approval does not signify that the contents necessarily reflect the views or the policies of the Agency nor does mention of trade names or commercial products constitute endorsement or recommendation for use.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.chemrestox.5c00440.
Package performance, elastic net coefficients, model error comparison, chemical exposure data hall-of-fame, lasso coefficients, and transcriptomic data hall-of-fame (XLSX)
Correlation of simulated variables, DECS for simulated variables, simulated data model error sensitivity, chemical exposure data scree plot, chemical exposure data model error sensitivity, correlation between dibenzofuran/retene and TP, chemical exposure data RF importance, chemical exposure data XGBoost importance, transcriptomic data model error sensitivity, transcriptomic data scree plot, transcriptomic data RF importance, and transcriptomic data XGBoost importance (PDF)
CRediT: Jessie R Chappel conceptualization, formal analysis, investigation, methodology, software, visualization, writing - original draft; Yong Ho Kim data curation, validation, writing - review & editing; M. Ian Gilmour data curation, validation, writing - review & editing; Erin S. Baker validation, writing - review & editing; Ilona Jaspers validation, writing - review & editing; Timothy M Weigand conceptualization, methodology, software, supervision, writing - review & editing; Julia E. Rager conceptualization, funding acquisition, investigation, methodology, project administration, supervision, writing - original draft, writing - review & editing.
This study was supported by a Cooperative Agreement with the US Environmental Protection Agency (CR84033801), research grants from the National Institutes of Health (NIH) (1R01ES035878, P42ES031007, R01ES035878, and R01ES035878-02S1) and the National Institute of General Medical Sciences (R01 GM141277), and training grants from the NIH (T32ES007126). Additional support was provided by the Institute for Environmental Health Solutions at the Gillings School of Global Public Health and the Leon and Bertha Golberg Postdoctoral Fellowship.
The authors declare no competing financial interest.
References
- Westerling A. L., Hidalgo H. G., Cayan D. R., Swetnam T. W.. Warming and earlier spring increase western U.S. forest wildfire activity. Science. 2006;313(5789):940–943. doi: 10.1126/science.1128834. [DOI] [PubMed] [Google Scholar]
- Hurteau M. D., Westerling A. L., Wiedinmyer C., Bryant B. P.. Projected effects of climate and development on California wildfire emissions through 2100. Environ. Sci. Technol. 2014;48(4):2298–2304. doi: 10.1021/es4050133. [DOI] [PubMed] [Google Scholar]
- UNISDR Economic Losses, Poverty, and Disasters. 2017. https://www.preventionweb.net/files/61119_credeconomiclosses.pdf (accessed 2022 Nov 1).
- Reisen F., Duran S. M., Flannigan M., Elliot C., Rideout K.. Wildfire smoke and public health risk. Int. J. Wildland Fire. 2015;24(8):1029–1044. doi: 10.1071/WF15034. [DOI] [Google Scholar]
- Kim Y. H., King C., Krantz T., Hargrove M. M., George I. J., McGee J., Copeland L., Hays M. D., Landis M. S., Higuchi M.. et al. The role of fuel type and combustion phase on the toxicity of biomass smoke following inhalation exposure in mice. Arch. Toxicol. 2019;93(6):1501–1513. doi: 10.1007/s00204-019-02450-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y. H., Sinha A., George I. J., DeMarini D. M., Grieshop A. P., Gilmour M. I.. Toxicity of fresh and aged anthropogenic smoke particles emitted from different burning conditions. Sci. Total Environ. 2023;892:164778. doi: 10.1016/j.scitotenv.2023.164778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y. H., Warren S. H., Krantz Q. T., King C., Jaskot R., Preston W. T., George B. J., Hays M. D., Landis M. S., Higuchi M.. et al. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ. Health Perspect. 2018;126(1):017011. doi: 10.1289/EHP2200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sánchez-García C., Santín C., Neris J., Sigmund G., Otero X. L., Manley J., González-Rodríguez G., Belcher C. M., Cerdà A., Marcotte A. L.. et al. Chemical characteristics of wildfire ash across the globe and their environmental and socio-economic implications. Environ. Int. 2023;178:108065. doi: 10.1016/j.envint.2023.108065. [DOI] [PubMed] [Google Scholar]
- Campos I., Abrantes N.. Forest fires as drivers of contamination of polycyclic aromatic hydrocarbons to the terrestrial and aquatic ecosystems. Current Opinion in Environmental Science & Health. 2021;24:100293. doi: 10.1016/j.coesh.2021.100293. [DOI] [Google Scholar]
- Johnston F. H., Webby R. J., Pilotto L. S., Bailie R. S., Parry D. L., Halpin S. J.. Vegetation fires, particulate air pollution and asthma: a panel study in the Australian monsoon tropics. Int. J. Environ. Health Res. 2006;16(6):391–404. doi: 10.1080/09603120601093642. [DOI] [PubMed] [Google Scholar]
- Dohrenwend P. B., Le M. V., Bush J. A., Thomas C. F.. The impact on emergency department visits for respiratory illness during the southern california wildfires. West J. Emerg Med. 2013;14(2):79–84. doi: 10.5811/westjem.2012.10.6917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delfino R. J., Brummel S., Wu J., Stern H., Ostro B., Lipsett M., Winer A., Street D. H., Zhang L., Tjoa T.. et al. The relationship of respiratory and cardiovascular hospital admissions to the southern California wildfires of 2003. Occup Environ. Med. 2009;66(3):189–197. doi: 10.1136/oem.2008.041376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanigan I. C., Johnston F. H., Morgan G. G.. Vegetation fire smoke, indigenous status and cardio-respiratory hospital admissions in Darwin, Australia, 1996–2005: a time-series study. Environ. Health. 2008;7:42. doi: 10.1186/1476-069X-7-42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castro H. A., Goncalves Kdos S., Hacon Sde S.. Trend of mortality from respiratory disease in elderly and the forest fires in the state of Rondonia/Brazil - period between 1998 and 2005. Ciênc Saúde Coletiva. 2009;14(6):2083–2090. doi: 10.1590/S1413-81232009000600015. [DOI] [PubMed] [Google Scholar]
- Kollanus V., Prank M., Gens A., Soares J., Vira J., Kukkonen J., Sofiev M., Salonen R. O., Lanki T.. Mortality due to Vegetation Fire-Originated PM2.5 Exposure in Europe-Assessment for the Years 2005 and 2008. Environ. Health Perspect. 2017;125(1):30–37. doi: 10.1289/EHP194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Curtis L.. PM(2.5), NO(2), wildfires, and other environmental exposures are linked to higher Covid 19 incidence, severity, and death rates. Environ. Sci. Pollut Res. Int. 2021;28(39):54429–54447. doi: 10.1007/s11356-021-15556-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghosh A., Payton A., Gallant S. C., Rogers K. L. Jr., Mascenik T., Hickman E., Love C. A., Schichlein K. D., Smyth T. R., Kim Y. H.. et al. Burn Pit Smoke Condensate-Mediated Toxicity in Human Airway Epithelial Cells. Chem. Res. Toxicol. 2024;37:791–803. doi: 10.1021/acs.chemrestox.4c00064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vance S. A., Kim Y. H., George I. J., Dye J. A., Williams W. C., Schladweiler M. J., Gilmour M. I., Jaspers I., Gavett S. H.. Contributions of particulate and gas phases of simulated burn pit smoke exposures to impairment of respiratory function. Inhal Toxicol. 2023;35(5–6):129–138. doi: 10.1080/08958378.2023.2169416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koval L. E., Carberry C. K., Kim Y. H., McDermott E., Hartwell H., Jaspers I., Gilmour M. I., Rager J. E.. Wildfire Variable Toxicity: Identifying Biomass Smoke Exposure Groupings through Transcriptomic Similarity Scoring. Environ. Sci. Technol. 2022;56(23):17131–17142. doi: 10.1021/acs.est.2c06043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y. H., Warren S. H., Kooter I., Williams W. C., George I. J., Vance S. A., Hays M. D., Higuchi M. A., Gavett S. H., DeMarini D. M.. et al. Chemistry, lung toxicity and mutagenicity of burn pit smoke-related particulate matter. Part Fibre Toxicol. 2021;18(1):45. doi: 10.1186/s12989-021-00435-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rager J. E., Clark J., Eaves L. A., Avula V., Niehoff N. M., Kim Y. H., Jaspers I., Gilmour M. I.. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci. Total Environ. 2021;775:145759. doi: 10.1016/j.scitotenv.2021.145759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogers K., WaMaina E., Barber A., Masood S., Love C., Kim Y. H., Gilmour M. I., Jaspers I.. Emissions from plastic incineration induce inflammation, oxidative stress, and impaired bioenergetics in primary human respiratory epithelial cells. Toxicol. Sci. 2024;199:301. doi: 10.1093/toxsci/kfae038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y. H., Rager J. E., Jaspers I., Gilmour M. I.. Computational Approach to Link Chemicals in Anthropogenic Smoke Particulate Matter with Toxicity. Chem. Res. Toxicol. 2022;35(12):2210–2213. doi: 10.1021/acs.chemrestox.2c00270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt C. W.. Into the Black Box: What Can Machine Learning Offer Environmental Health Research? Environ. Health Perspect. 2020;128(2):22001. doi: 10.1289/EHP5878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Payton A., Roell K. R., Rebuli M. E., Valdar W., Jaspers I., Rager J. E.. Navigating the bridge between wet and dry lab toxicology research to address current challenges with high-dimensional data. Front Toxicol. 2023;5:1171175. doi: 10.3389/ftox.2023.1171175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chappel J. R., Kirkwood-Donelson K. I., Reif D. M., Baker E. S.. From big data to big insights: statistical and bioinformatic approaches for exploring the lipidome. Anal Bioanal Chem. 2024;416(9):2189–2202. doi: 10.1007/s00216-023-04991-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phair R. D.. Mechanistic modeling confronts the complexity of molecular cell biology. Mol. Biol. Cell. 2014;25(22):3494–3496. doi: 10.1091/mbc.e14-08-1333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vishwarupe V., Joshi P. M., Mathias N., Maheshwari S., Mhaisalkar S., Pawar V.. Explainable AI and Interpretable Machine Learning: A Case Study in Perspective. Procedia Computer Science. 2022;204:869–876. doi: 10.1016/j.procs.2022.08.105. [DOI] [Google Scholar]
- Jia X., Wang T., Zhu H.. Advancing Computational Toxicology by Interpretable Machine Learning. Environ. Sci. Technol. 2023;57(46):17690–17706. doi: 10.1021/acs.est.3c00653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eytcheson S. A., Tetko I. V.. Which Modern AI Methods Provide Accurate Predictions of Toxicological End Points? Analysis of Tox24 Challenge Results. Chem. Res. Toxicol. 2025;38(9):1443–1451. doi: 10.1021/acs.chemrestox.5c00273. [DOI] [PubMed] [Google Scholar]
- Udrescu S.-M., Tegmark M.. AI Feynman: A physics-inspired method for symbolic regression. Science Advances. 2020;6(16):eaay2631. doi: 10.1126/sciadv.aay2631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Angelis D., Sofos F., Karakasidis T. E.. Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives. Archives of Computational Methods in Engineering. 2023;30(6):3845–3865. doi: 10.1007/s11831-023-09922-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makke N., Chawla S.. Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review. 2024;57(1):2. doi: 10.1007/s10462-023-10622-0. [DOI] [Google Scholar]
- Christensen N. J., Demharter S., Machado M., Pedersen L., Salvatore M., Stentoft-Hansen V., Iglesias M. T.. Identifying interactions in omics data for clinical biomarker discovery using symbolic regression. Bioinformatics. 2022;38(15):3749–3758. doi: 10.1093/bioinformatics/btac405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kogelman L. J. A., Falkenberg K., Ottosson F., Ernst M., Russo F., Stentoft-Hansen V., Demharter S., Tfelt-Hansen P., Cohen A. S., Olesen J.. et al. Multi-omic analyses of triptan-treated migraine attacks gives insight into molecular mechanisms. Sci. Rep. 2023;13(1):12395. doi: 10.1038/s41598-023-38904-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daryakenari N. A., De Florio M., Shukla K., Karniadakis G. E.. AI-Aristotle: A physics-informed framework for systems biology gray-box identification. PLOS Computational Biology. 2024;20(3):e1011916. doi: 10.1371/journal.pcbi.1011916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keren L. S., Liberzon A., Lazebnik T.. A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. Sci. Rep. 2023;13(1):1249. doi: 10.1038/s41598-023-28328-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephens, T. Genetic programming in python, with a scikit-learn inspired api. 2015. https://github.com/trevorstephens/gplearn (accessed 2025.
- Cranmer M.. Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. arXiv. 2023 doi: 10.48550/arXiv.2305.01582. [DOI] [Google Scholar]
- Matrix: Sparse and Dense Matrix Classes and Methods. 2024. https://CRAN.R-project.org/package=Matrix (accessed.
- Venables, W. N. ; Ripley, B. D. . Modern Applied Statistics with S; Springer Science & Business Media, 2002. [Google Scholar]
- Henderson R. F.. Use of bronchoalveolar lavage to detect respiratory tract toxicity of inhaled material. Experimental and Toxicologic Pathology. 2005;57:155–159. doi: 10.1016/j.etp.2005.05.004. [DOI] [PubMed] [Google Scholar]
- Świercz R., Hałatek T., Stetkiewicz J., Wąsowicz W., Kur B., Grzelińska Z., Majcherek W.. Toxic effect in the lungs of rats after inhalation exposure to benzalkonium chloride. International Journal of Occupational Medicine and Environmental Health. 2013;26(4):647–656. doi: 10.2478/s13382-013-0137-8. [DOI] [PubMed] [Google Scholar]
- Kwon J.-T., Yang Y.-S., Kang M.-S., Seo G.-B., Lee D. H., Yang M.-J., Shim I., Kim H.-M., Kim P., Choi K.. et al. Pulmonary toxicity screening of triclosan in rats after intratracheal instillation. Journal of Toxicological Sciences. 2013;38(3):471–475. doi: 10.2131/jts.38.471. [DOI] [PubMed] [Google Scholar]
- Virgolin M., Pissis S. P.. Symbolic Regression is NP-hard. arXiv preprint arXiv:2207.01018. 2022 doi: 10.48550/arXiv.2207.01018. [DOI] [Google Scholar]
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V.. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Tibshirani R.. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 1996;58(1):267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer-Verlag: New York, 2016. [Google Scholar]
- pheatmap: Pretty Heatmaps. 2025. https://CRAN.R-project.org/package=pheatmapNOTE- R package version 1.0.13 (accessed).
- Hunter J. D.. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9(3):90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
- NCBI Gene Expression Omnibus, 2025.
- Canzler S., Schor J., Busch W., Schubert K., Rolle-Kampczyk U. E., Seitz H., Kamp H., von Bergen M., Buesen R., Hackermüller J.. Prospects and challenges of multi-omics data integration in toxicology. Arch. Toxicol. 2020;94(2):371–388. doi: 10.1007/s00204-020-02656-y. [DOI] [PubMed] [Google Scholar]
- Moorthy B., Chu C., Carlin D. J.. Polycyclic Aromatic Hydrocarbons: From Metabolism to Lung Cancer. Toxicol. Sci. 2015;145(1):5–15. doi: 10.1093/toxsci/kfv040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holme J. A., Vondráček J., Machala M., Lagadic-Gossmann D., Vogel C. F. A., Le Ferrec E., Sparfel L., Øvrevik J.. Lung cancer associated with combustion particles and fine particulate matter (PM2.5) - The roles of polycyclic aromatic hydrocarbons (PAHs) and the aryl hydrocarbon receptor (AhR) Biochem. Pharmacol. 2023;216:115801. doi: 10.1016/j.bcp.2023.115801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stading R., Gastelum G., Chu C., Jiang W., Moorthy B.. Molecular mechanisms of pulmonary carcinogenesis by polycyclic aromatic hydrocarbons (PAHs): Implications for human lung cancer. Seminars in Cancer Biology. 2021;76:3–16. doi: 10.1016/j.semcancer.2021.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peixoto M. S., da Silva Junior F. C., de Oliveira Galvão M. F., Roubicek D. A., de Oliveira Alves N., Batistuzzo de Medeiros S. R.. Oxidative stress, mutagenic effects, and cell death induced by retene. Chemosphere. 2019;231:518–527. doi: 10.1016/j.chemosphere.2019.05.123. [DOI] [PubMed] [Google Scholar]
- da Silva Junior F. C., Lin J., Hecker M., Batistuzzo de Medeiros S. R.. Unveiling the molecular responses of human lung cells to retene: Transcriptomics insights and implications for toxicity. Environmental Toxicology and Pharmacology. 2025;119:104828. doi: 10.1016/j.etap.2025.104828. [DOI] [PubMed] [Google Scholar]
- Duarte F. V., Teodoro J. S., Rolo A. P., Palmeira C. M.. Exposure to dibenzofuran triggers autophagy in lung cells. Toxicol. Lett. 2012;209(1):35–42. doi: 10.1016/j.toxlet.2011.11.029. [DOI] [PubMed] [Google Scholar]
- Zhang Z., Zhou M., He J., Shi T., Zhang S., Tang N., Chen W.. Polychlorinated dibenzo-dioxins and polychlorinated dibenzo-furans exposure and altered lung function: The mediating role of oxidative stress. Environ. Int. 2020;137:105521. doi: 10.1016/j.envint.2020.105521. [DOI] [PubMed] [Google Scholar]
- Houser K. R., Johnson D. K., Ishmael F. T.. Anti-inflammatory effects of methoxyphenolic compounds on human airway cells. J. Inflamm (Lond) 2012;9:6. doi: 10.1186/1476-9255-9-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X., Liu Y., Lu Y., Ma C.. Anti-inflammatory effects of eugenol on lipopolysaccharide-induced inflammatory reaction in acute lung injury via regulating inflammation and redox status. International Immunopharmacology. 2015;26(1):265–271. doi: 10.1016/j.intimp.2015.03.026. [DOI] [PubMed] [Google Scholar]
- Magalhães C. B., Riva D. R., DePaula L. J., Brando-Lima A., Koatz V. L. G., Leal-Cardoso J. H., Zin W. A., Faffe D. S.. In vivo anti-inflammatory action of eugenol on lipopolysaccharide-induced lung injury. J. Appl. Physiol. 2010;108(4):845–851. doi: 10.1152/japplphysiol.00560.2009. [DOI] [PubMed] [Google Scholar]
- Murakami Y., Hirata A., Ito S., Shoji M., Tanaka S., Yasui T., Machino M., Fujisawa S.. Re-evaluation of cyclooxygenase-2-inhibiting activity of vanillin and guaiacol in macrophages stimulated with lipopolysaccharide. Anticancer Res. 2007;27(2):801–807. [PubMed] [Google Scholar]
- Magalhães C. B., Casquilho N. V., Machado M. N., Riva D. R., Travassos L. H., Leal-Cardoso J. H., Fortunato R. S., Faffe D. S., Zin W. A.. The anti-inflammatory and anti-oxidative actions of eugenol improve lipopolysaccharide-induced lung injury. Respiratory Physiology & Neurobiology. 2019;259:30–36. doi: 10.1016/j.resp.2018.07.001. [DOI] [PubMed] [Google Scholar]
- Chniguir A., Saguem M. H., Dang P. M., El-Benna J., Bachoual R.. Eugenol Inhibits Neutrophils Myeloperoxidase In Vitro and Attenuates LPS-Induced Lung Inflammation in Mice. Pharmaceuticals (Basel) 2024;17(4):504. doi: 10.3390/ph17040504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zin W. A., Silva A. G. L. S., Magalhães C. B., Carvalho G. M. C., Riva D. R., Lima C. C., Leal-Cardoso J. H., Takiya C. M., Valença S. S., Saldiva P. H. N.. et al. Eugenol attenuates pulmonary damage induced by diesel exhaust particles. J. Appl. Physiol. 2012;112(5):911–917. doi: 10.1152/japplphysiol.00764.2011. [DOI] [PubMed] [Google Scholar]
- Cioroiu B. I., Tarcau D., Cucu-Man S., Chisalita I., Cioroiu M.. Polycyclic aromatic hydrocarbons in lung tissue of patients with pulmonary cancer from Romania. Influence according as demographic status and ABO phenotypes. Chemosphere. 2013;92(5):504–511. doi: 10.1016/j.chemosphere.2013.02.014. [DOI] [PubMed] [Google Scholar]
- Jeon T. W., Jin C. H., Lee S. K., Lee D. W., Hyun S. H., Kim G. H., Jun I. H., Lee B. M., Yum Y. N., Kim J. K.. et al. In vivo and in vitro immunosuppressive effects of benzo[k]fluoranthene in female Balb/c mice. J. Toxicol Environ. Health A. 2005;68(23-24):2033–2050. doi: 10.1080/15287390491009147. [DOI] [PubMed] [Google Scholar]
- Gastelum G., Jiang W., Wang L., Zhou G., Borkar R., Putluri N., Moorthy B.. Polycyclic Aromatic Hydrocarbon-induced Pulmonary Carcinogenesis in Cytochrome P450 (CYP)1A1- and 1A2-Null Mice: Roles of CYP1A1 and CYP1A2. Toxicol. Sci. 2020;177(2):347–361. doi: 10.1093/toxsci/kfaa107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwack S. J., Lee B. M.. Correlation between DNA or protein adducts and benzo[a]pyrene diol epoxide I-triglyceride adduct detected in vitro and in vivo. Carcinogenesis. 2000;21(4):629–632. doi: 10.1093/carcin/21.4.629. [DOI] [PubMed] [Google Scholar]
- Moorthy B., Miller K. P., Jiang W., Ramos K. S.. The atherogen 3-methylcholanthrene induces multiple DNA adducts in mouse aortic smooth muscle cells: role of cytochrome P4501B1. Cardiovasc. Res. 2002;53(4):1002–1009. doi: 10.1016/S0008-6363(01)00536-3. [DOI] [PubMed] [Google Scholar]
- Tsay J. J., Tchou-Wong K. M., Greenberg A. K., Pass H., Rom W. N.. Aryl hydrocarbon receptor and lung cancer. Anticancer Res. 2013;33(4):1247–1256. [PMC free article] [PubMed] [Google Scholar]
- Quah Y., Jung S., Chan J. Y., Ham O., Jeong J. S., Kim S., Kim W., Park S. C., Lee S. J., Yu W. J.. Predictive biomarkers for embryotoxicity: a machine learning approach to mitigating multicollinearity in RNA-Seq. Arch. Toxicol. 2024;98(12):4093–4105. doi: 10.1007/s00204-024-03852-w. [DOI] [PubMed] [Google Scholar]
- Ahmadi S. E., Rahimi S., Zarandi B., Chegeni R., Safa M.. MYC: a multipurpose oncogene with prognostic and therapeutic implications in blood malignancies. Journal of Hematology & Oncology. 2021;14(1):121. doi: 10.1186/s13045-021-01111-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nembrini S., König I. R., Wright M. N.. The revival of the Gini importance? Bioinformatics. 2018;34(21):3711–3718. doi: 10.1093/bioinformatics/bty373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanahan D.. Hallmarks of Cancer: New Dimensions. Cancer Discov. 2022;12(1):31–46. doi: 10.1158/2159-8290.CD-21-1059. [DOI] [PubMed] [Google Scholar]
- Li J., Dong T., Wu Z., Zhu D., Gu H.. The effects of MYC on tumor immunity and immunotherapy. Cell Death Discov. 2023;9(1):103. doi: 10.1038/s41420-023-01403-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wallbillich N. J., Lu H.. Role of c-Myc in lung cancer: Progress, challenges, and prospects. Chin Med. J. Pulm Crit Care Med. 2023;1(3):129–138. doi: 10.1016/j.pccm.2023.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fields W. R., Desiderio J. G., Leonard R. M., Burger E. E., Brown B. G., Doolittle D. J.. Differential c-myc expression profiles in normal human bronchial epithelial cells following treatment with benzo[a]pyrene, benzo[a]pyrene-4,5 epoxide, and benzo[a]pyrene-7,8–9,10 diol epoxide. Mol. Carcinog. 2004;40(2):79–89. doi: 10.1002/mc.20023. [DOI] [PubMed] [Google Scholar]
- Yang Y., Ma D., Liu B., Sun X., Fu W., Lv F., Qiu C.. E3 Ubiquitin Ligase ASB14 Inhibits Cardiomyocyte Proliferation by Regulating MAPRE2 Ubiquitination. Cell Biochem Biophys. 2024;82(2):715–727. doi: 10.1007/s12013-024-01223-x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




