Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science

Joseph D Romano; Liang Mei; Jonathan Senn; Jason H Moore; Holly M Mortensen

doi:10.1016/j.comtox.2023.100261

. Author manuscript; available in PMC: 2023 Oct 12.

Published in final edited form as: Comput Toxicol. 2023 Jan 25;25:100261. doi: 10.1016/j.comtox.2023.100261

Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science

Joseph D Romano ^a,^b,^c, Liang Mei ^d, Jonathan Senn ^d, Jason H Moore ^e, Holly M Mortensen ^f,^*

PMCID: PMC10569310 NIHMSID: NIHMS1933008 PMID: 37829618

Abstract

Adverse outcome pathways provide a powerful tool for understanding the biological signaling cascades that lead to disease outcomes following toxicity. The framework outlines downstream responses known as key events, culminating in a clinically significant adverse outcome as a final result of the toxic exposure. Here we use the AOP framework combined with artificial intelligence methods to gain novel insights into genetic mechanisms that underlie toxicity-mediated adverse health outcomes. Specifically, we focus on liver cancer as a case study with diverse underlying mechanisms that are clinically significant. Our approach uses two complementary AI techniques: Generative modeling via automated machine learning and genetic algorithms, and graph machine learning. We used data from the US Environmental Protection Agency’s Adverse Outcome Pathway Database (AOP-DB; aopdb.epa.gov) and the UK Biobank’s genetic data repository. We use the AOP-DB to extract disease-specific AOPs and build graph neural networks used in our final analyses. We use the UK Biobank to retrieve real-world genotype and phenotype data, where genotypes are based on single nucleotide polymorphism data extracted from the AOP-DB, and phenotypes are case/control cohorts for the disease of interest (liver cancer) corresponding to those adverse outcome pathways. We also use propensity score matching to appropriately sample based on important covariates (demographics, comorbidities, and social deprivation indices) and to balance the case and control populations in our machine language training/testing datasets. Finally, we describe a novel putative risk factor for LC that depends on genetic variation in both the aryl-hydrocarbon receptor (AHR) and ATP binding cassette subfamily B member 11 (ABCB11) genes.

Keywords: Adverse outcome pathway, Liver cancer, genetic programming, Graph data science

1. Introduction

Informatics and computational methods have revolutionized biomedical research and enabled scientists to explore questions that are either infeasible or impossible through traditional experimentation alone [32]. In environmental health and toxicology, common computational tasks include building and training models that predict various chemical properties, conducting statistical analysis of observational and epidemiological data to better understand exposure-related health outcomes, and performing network analyses to discover key processes in biochemical pathways, among others [17,38]. Despite the successes made using these methods, some key deficiencies have become apparent in toxicological research, such as a lack of richly structured, multimodal biomedical data describing chemicals and the biological systems that respond to chemical exposure [31] and a paucity of novel methods for discovering new knowledge from these complex data resources [36]. In this paper, we employ both to gain new insights into a phenomenon of growing interest: the influence of genetics on susceptibility to an adverse outcome following specific chemical exposures.

Adverse Outcome Pathways (AOPs) are pathway-like descriptions that outline the mechanistic associations between molecular exposure events and higher-order clinical and population-level outcomes that may arise from the exposure [2,26]. AOPs consist of molecular initiating events (MIEs), key events (KEs), and adverse outcomes (AOs). By definition, a KE is any internal step within an AOP at some level of biological organization, and an MIE is a particular kind of KE that both initiates an AOP and is comprised of a molecular interaction between a toxicant and a body component. AOPs are classified according to their respective health outcomes, and AOPs associated with similar outcomes often overlap to create an ‘AOP network.’ An AOP’s set of KEs can include genetic polymorphisms that are associated with higher risk to the adverse outcome. For example, colon cancer AOPs include 53 unique SNP associations originally derived from GWAS [37]. This study will attempt to look at the influence genetic phenomena have on susceptibility to adverse outcomes after specific chemical exposures using AOPs as a framework for reference.

Methodologically, one area in particular that has experienced rapid growth, and holds great promise in all areas of biomedicine, is artificial intelligence (AI). AI broadly aims to construct computational systems that make intelligent decisions based on available data, knowledge, and/or human input. The scope of what comprises AI is broad, and usually nebulously defined. In this paper, we explore two areas within AI: Evolutionary algorithms and graph data science. Evolutionary algorithms are a family of algorithms that imitate processes found in biological evolution to optimize a system (e.g., a predictive model, a symbolic mathematical equation, or even another algorithm). Unsurprisingly, evolutionary computation is often used in computational biology, for example, in the context of simulating natural systems or processes [10,11,22] and building machine learning classifiers that perform well on a specific task [16,23,28]. Graph data science refers to the quantitative analysis of graphs – sometimes known as networks (e.g., biological networks), and comprised of a set of nodes connected by a set of edges that define relationships between those nodes [7,27]. Some tasks within graph data science involve community detection [9], identification of the shortest paths linking two nodes in a graph [12], determining ‘hub nodes’ that play critical roles in the global structure of a graph [7,41], and using computational algorithms that yield quantitative understandings of the behavior and characteristics of a given graph [1,15]. Since AOPs can be represented as graphs, graph data science provides a powerful set of tools for discovering properties of AOPs that are not obvious through manual inspection.

Here, we propose a novel approach to gain understanding of the mechanisms underlying genetic influences on toxic adverse outcomes, without the inclusion of associated case-control information, that leverages these two areas of AI, and subsequently evaluates the approach in the context of toxicity-mediated adverse outcome pathways involved in liver cancer (LC). Briefly, we train interpretable generative models to construct synthetic datasets resembling real-world LC AOP genotype data via the HIBACHI software, and introspect the best models produced by HIBACHI (Heuristic Identification of Biological Architectures for simulating Complex Hierarchical genetic Interactions) for the most prominent AOP SNPs that influence LC outcomes. HIBACHI is a command line utility based on genetic programming (GP) that generates (synthetic) datasets with interactions between input features [24,25]. It uses the (μ + λ) evolutionary algorithm [6] to construct trees of primitive mathematical operations that can represent interactions between independent variables. For example, when applied to genetic data, these feature interactions may represent epistasis or mechanisms underlying polygenic traits. HIBACHI can take an existing dataset – referred to in the context of GP as a model – as input, which is then used to evaluate the fitness of candidate output datasets. Our hypothesis is that HIBACHI can create synthetic datasets of SNPs involved in AOPs that behave the same as real data for the same AOPs. This will allow us to explore the interpretable generative models used to create the synthetic data, which gives insights into interactions between specific features in the real data used to train HIBACHI. Conceptually, this process can be likened to a brute-force version of symbolic regression [18] that avoids pitfalls arising from statistical analyses on genetic data with complex interactions between features [40]. Importantly, this approach utilizes genomic and phenotypic data from real-world populations, combined with information and knowledge sourced from publicly available, open access databases describing mechanisms of toxicity. Our methods are generalizable to other diseases of interest and provide a new framework for toxicologists to explore genetic mechanisms that underlie toxic adverse outcome susceptibility.

2. Methods

2.1. Data sources

Our analysis uses data from the US Environmental Protection Agency’s Adverse Outcome Pathway Database (AOP-DB) and the UK Biobank (UKBB). The AOP-DB provides a formal structure for AOPs and their contained key events, as well as the relationships and associations between key events, genes (and their variants), metabolic pathways, diseases, and other relationships of toxicological interest. Data in the AOP-DB are aggregated from third-party public databases, including automated data pulls from the AOP-Wiki [26], as part of the OECD-supported EAGMST AOP-KB sub-group effort.

The UKBB is a large collection of longitudinal genetic, clinical, and demographic data on more than 500,000 adult volunteers in the UK, and is available to the international research community via application (https://biobank.ndph.ox.ac.uk/showcase/index.cgi) [29,35]. These data are suitable for observational analysis of a vast array of clinical phenomena. Here, we utilized data on single nucleotide polymorphisms (SNPs), disease diagnoses, and relevant demographics data collected through extensive patient questionnaires. We use SNPs to establish genotypes that are implicated in AOPs relevant to the toxic outcome of interest, diagnoses to construct case and control cohorts, and demographic data to balance cohorts with respect to a number of demographic and clinical covariates of interest. All UKBB data used in this study are from the current data release as of November 2020.

2.2. Obtaining genotypes for cohort patients

In this study, we focus on LC as a clinical endpoint of interest, but our methods are generalizable to other diseases. Although there are several major subtypes of LC, we treat it as a single disease phenotype, due both to a lack of granularity in established LC AOPs, as well as to provide a larger training dataset for the HIBACHI program. To find genetic variants that play a role in the etiology of LC, we retrieve AOPs related to LC and extract SNPs annotated to key events within those AOPs. Using the AOP-DB, we query AOP titles, organ specificity annotations, and event components (KEs and MIEs) for presence of the terms “liver” and “hepatocellular” to fetch AOPs related to LC. These AOPs, MIEs, and KEs are listed in-detail in Table S1. We then manually remove any AOPs describing hepatic steatosis – a disease that, while a known risk factor for LC, has a distinctly different underlying etiology (Schulz et al. 2015). Using these identified AOPs, we retrieve annotations to the EntrezGene database via associations present in the AOP-DB’s “AOP_gene” table [26]. In creating the AOP-DB, SNPs associated with KEs were originally obtained from the GTEX v7 Single Tissue eQTL dataset [13] and from the GWAS v1.0.2 All Associations dataset (https://www.ebi.ac.uk/gwas/docs/methods/criteria).

Finally, we assess overlap between the AOP SNPs and SNPs included in the UKBB genetic data. For every SNP we identify in the AOP data, we obtain genotypes at that locus for all patients in the cohorts defined below, and encode them in an additive model, (e.g., homozygous major allele is “0′′, heterozygous is “1”, and homozygous minor allele is “2”) since this format is easily consumed by downstream analysis tools (e.g., HIBACHI). All AOP SNPs not included in the UKBB data were omitted from consideration in downstream analyses. It should be noted that all LC/SNP associations are determined using expert-curated biomedical knowledge originally mined from the AOP-Wiki, and are therefore independent from any observational biases that may be present in the UK Biobank genotype data.

2.3. Phenotyping and assembling patient cohorts

To assemble cohorts for statistical modeling of our toxic outcome of interest, we retrieved pertinent data from the UKBB [35]. We first filter all patients in the UKBB based on availability of SNPs included in our AOP network. Using the set of SNPs identified above (SNPs found in both AOPs and the UKBB), we retrieve unique identifiers for patients with that set of SNPs available. To construct raw (unbalanced) case and control cohorts, we then separated this set of patients into those with a diagnosis of LC (based on presence of the ICD-9 code prefix “C22”) and without LC (all others).

Because many environmental factors can act as confounding variables in observational analyses of complex diseases, these confounders need to be balanced in the case and control cohorts to minimize the risk of predictive models learning to distinguish patients based on the confounding variables rather than the presence or absence of the disease of interest. In the case of this study, the predictive model is the output of HIBACHI’s genetic programming algorithm, and the disease of interest is LC. Although there are several strategies for producing balanced cohorts, we used the propensity score matching (PSM) method. Briefly, PSM involves training a logistic regression model where input features are the confounders and the output is the propensity score, or probability of being a member of the treatment (case) group [5]. This logistic regression model is then used to match each sample in the case cohort to one or more samples in the control cohort based on having similar propensity scores. The resulting cohorts have an (approximately) balanced distribution of propensity scores within each possible value across all confounders. For confounders with continuous rather than categorical values (e.g., age), values are binned into equally sized groups across the range of values prior to matching. In doing so, PSM minimizes the estimation bias contributed by each confounding variable to the overall predictions of a model trained on the balanced dataset.

In this study, we performed PSM on the raw case and control cohorts using the following confounding features: age at recruitment, sex, ethnicity, and Townsend deprivation index (a composite measure of material deprivation within a population, incorporating employment status, car ownership, home ownership, and household overcrowding) [39]. Each of these is a known demographic confounder for LC risk, and the data is provided by UKBB questionnaire data available for all patients. Additionally, inclusion of the Townsend index helps to ensure generalizability of study results across socioeconomic groups, particularly those with historically poor access to quality healthcare. We also included diabetes status (presence of the ICD-9 code prefix “E1”) as a cofounder, as diabetes is a significant risk factor for LC [19]. We used the Pymatch library (https://github.com/benmiroglio/pymatch) to construct the propensity score model, perform the matching procedure, and visualize confounder imbalance before and after matching. Since LC is a relatively rare diagnosis in the UKBB data, we increased the size of our dataset for training HIBACHI by matching 4 control patients to each case patient. We specified a propensity score similarity threshold of 1*10⁻⁴ – the smallest value that retains 100 % of the LC cohort.

2.4. Exploring genetic contributions to toxicity using genetic programming

We ran HIBACHI (available on GitHub at https://github.com/EpistasisLab/hibachi) on an input model consisting of patients in the PSM-balanced case and control cohorts constructed using the method described above. Specifically, we retrieved the LC SNPs of interest (described above) for each of the patients in the balanced case and control cohorts and used those to construct a feature matrix (in the 0,1,2 format, representing an additive or ordinal genetic model) with LC outcome being the binary target variable. We then trained HIBACHI on this LC dataset, with algorithm metaparameters of 100 generations of evolution and a population size of 100. Since HIBACHI outputs both a synthetic dataset with the characteristics of the training dataset as well as the generative model used to construct that dataset, we inspected both in order to explore genetic mechanisms that may govern susceptibility to LC following toxic exposures. To account for potential linkage disequilibrium (LD) between implicated SNPs, we computed pairwise R² and D’ values between all implicated SNPs (i.e., showing up more than once in the learned generative models) using the LDpair module in the National Cancer Institute’s LDlink toolkit [20]. Any pair of SNPs in statistically significant LD should be treated as suspect if they occur in the same generative model.

3. Results

3.1. AOPs and SNPs associated with liver cancer

Our initial query for LC AOPs finds 16 liver related AOPs and 189 SNPs associated with these AOPs. AOPs 1, 37, 41, 46, 107, 108, and 117 are specific, describing a particular etiology of LC or hepatocellular carcinoma, while the other AOPs describe LC in a more general context. Interestingly, the AOPs describing liver fibrosis, hepatotoxicity, and liver injury contain no SNP associations, although a number of these AOPs are still under development. The AOPs that feature SNP associations often specify a primary gene, inhibitor, or activator that plays a key role in the AOP, such as ABCB11 for AOP 27, PPARɑ for AOP 37, AHR for AOP 41, and AFB1 for AOP 46. Five AOPs in this list were derived from rodent experimental data (AOPs 37, 41, 107, 108, and 117), while the rest are based upon human-derived evidence. A full list of LC AOPs and their associated SNPs (including omitted AOPs related to hepatic steatosis) is given in Supplemental Information (Table S1). Figs. 1 and 2.

Fig. 1. — Building balanced cohorts for learning interactions between AOP key events using HIBACHI.

Fig. 2. — Network diagram highlighting the SNPs found in HIBACHI’s most fit models, along with the network context of their associated genes and AOPs. Note that SNPs not highlighted by the HIBACHI models are omitted. A full network of all LC AOPs along with their full sets of associated genes and SNPs can be found in Supplemental Figure S1.

3.2. UKBB liver cancer cohort characteristics

Of the 189 SNPs identified within the AOPs, 25 are represented in the UKBB genotype data (Table 1). The remaining 164 SNPs may be missing due to limited coverage of genotyping panels, semantic inconsistencies between the AOP-DB and UKBB variant nomenclature, or other issues. We identified 488,377 patients with genotypes available for these SNPs. Of these patients, 580 had an LC diagnosis. We then generated balanced case and control cohorts using the propensity score matching method described above. To ensure that the matching procedure was effective, we generated plots for case/control covariate ratios both before and after the matching and used the chi-square test for independence to verify that these ratios are significantly different. For every covariate included in PSM, the difference before and after matching was highly significant, indicating that the dataset was highly unbalanced before PSM, and well-balanced after PSM. Recall that “balanced” in terms of PSM does not necessarily mean equal – rather, the counts of patients within each demographic group were sampled in a way that minimizes estimation bias contributed from each model covariate. For example, the most prevalent ethnicity in our dataset by far is “White – British”, in both the original and PSM-balanced datasets. Full details and visualizations of PSM are provided in Supplementary Information. The final, balanced dataset includes 2,895 patients (579 cases, 2,316 controls) with approximately equal distributions of all covariates in the two cohorts. We matched 4 controls to each case, to help compensate for the relative rarity of LC in the overall patient population.

Table 1.

Each relevant SNP, along with respective gene, associated AOP_id, and how each SNP was represented in the HIBACHI program.

SNP	Gene (Hugo)	AOP_id	Hibachi identifier
rs2025516	NR1l3	107	X ₁
rs4073054	NR1l3	107	X ₂
rs115624142	NR1l3	107	X ₃
rs116791819	NR1l3	107	X ₄
rs12069336	NR1l3	107	X ₅
rs72884586	ABCB11	27	X ₆
rs563694	ABCB11	27	X ₇
rs569805	ABCB11	27	X ₈
rs16856247	ABCB11	27	X ₉
rs552976	ABCB11	27	X ₁₀
rs2287623	ABCB11	27	X ₁₁
rs16856332	ABCB11	27	X ₁₂
rs10172795	ABCB11	27	X ₁₃
rs117263259	AHR	41	X ₁₄
rs71540771	AHR	41	X ₁₅
rs117132860	AHR	41	X ₁₆
rs4476901	AHR	41	X ₁₇
rs115256444	AHR	41	X ₁₈
rs4410790	AHR	41	X ₁₉
rs6968865	AHR	41	X ₂₀
rs12670403	AHR	41	X ₂₁
rs11109969	NR1H4	27	X ₂₂
rs1625895	TP53	46	X ₂₃
rs4253772	PPARA	37	X ₂₄
rs5031002	AR	117	X ₂₅

Open in a new tab

3.3. Using HIBACHI to infer AOP-related genetic interactions

The 7 best (i.e., having the highest fitness score on the balanced input dataset) models found by HIBACHI are shown in Table 2. Since our response variable (LC) is encoded as a binary target in the dataset (1 = LC, 0 = no LC), the models generally only produce binary outcomes. Note that specific motifs are replicated across several of the best models, which are indicative of robust relationships between specific SNPs that influence risk, as well as the evolutionary nature of HIBACHI’s algorithm–the most fit models in each generation ‘survive’ and are subject to refinement by mutation in subsequent generations. Refer to the Discussion section for a more complete interpretation of the interactions suggested by the most fit models. The 4 SNPs that appear repeatedly in the 7 most fit models are X₇ (rs563694), X₁₀ (rs552976), X₁₉ (rs4410790), and X₉ (rs16856247).

Table 2.

Most fit generative models learned by HIBACHI, trained on the propensity score matched UKBB genotype dataset. Along with the model, HIBACHI produces a synthetic version of the training dataset constructed using that model. Higher fitness scores indicate better approximation of the training dataset. Arithmetic operations are applied to the values (0, 1, or 2) comprising the input dataset – for example, “X₁₀!” indicates “the factorial of the value representing SNP X₁₀”. “XOR” and “AND” are logical Boolean operations, and ‘mod’ is the modulo operation.

graphic file with name nihms-1933008-t0001.jpg

Open in a new tab

3.4. Estimating independence of identified SNPs via linkage disequilibrium

Of the 4 well-represented SNPs in the HIBACHI models, only X₇ (rs563694) and X₁₀ (rs552976) were found to be in linkage disequilibrium (LD) with a r² of 0.419 and a D’ of 0.868 [33]. Therefore, motifs involving both X₇ and X₁₀ (which are present in the top 4 models in Table 2) should be treated as suspect, since the SNPs will tend to segregate together. Nonetheless, their presence in the top 4 models is evidence that HIBACHI indeed detects meaningful patterns in the SNP data and incorporates those patterns into its learned generative models. Full results of the LD analysis are given in Supplemental Table S3.

4. Discussion

The 4 SNPs implicated by HIBACHI are members of 2 AOPs: Cholestatic Liver Injury induced by Inhibition of the Bile Salt Export Pump (ABCB11), and Sustained AhR Activation leading to Rodent Liver Tumors. Since these two AOPs directly implicate key roles played by the Abcb11 and Ahr genes, these can be thought of as the central mediators of genetic risk to toxicity-induced LC. However, although these genes may be the most important in terms of disease etiology, the HIBACHI-identified SNPs may instead serve as regulatory mechanisms that influence the tendency of those genes to result in a disease state. The implications of this finding could impact many areas of research, including suggesting new therapeutic targets for treatment/prevention, previously unknown stressors, or even new subtypes of LC (e.g., hepatocellular carcinoma, cholangiocarcinoma, etc.) with different etiologies and progression of disease. This study shows how genetic programming can be leveraged to create new hypotheses for future targeted investigations. Although we focused particularly on LC, our approach should be easily generalizable to other diseases, given adequate AOP coverage for the disease and sufficient observational data to construct the respective cohorts.

Although in this study we demonstrate the ability of genetic programming and heuristic simulation to gain insights into the genetic mechanisms underlying toxicity risk, we have not yet explored the influence of specific stressors (i.e., toxicants) on these genetic mechanisms. For example, a genetic factor might influence whether an AOP is triggered by a certain stressor, but not other stressors. The key proteins involved in the two AOPs we describe above (ABCB11 and AHR) are well-studied and many ligands have been established. Currently known stressors for ABCB11 include cholestasis-inducing drugs (e.g., cyclosporine A, rifampicin, others) [30]. AHR has many known stressors, including the two diverse families known as the halogenated and polycyclic aromatic hydrocarbons [14]. As of now, no stressors are formally encoded for these two AOPs in either the AOP-DB or the AOP-Wiki, but we expect these data will be completed as data curation efforts for computational toxicology continue to mature.

4.1. Generative models are suggestive of epistatic interactions in conferring LC risk

As we discussed previously, the generative models produced by HIBACHI (see Table 2) are interpretable mathematical models that can be used to generate synthetic data with similar characteristics to the training data. These models can be likened to symbolic regression models, albeit computed using a brute-force search process with evolutionary refinement rather than via convex optimization. Therefore, specific operations in the highest ranked generative models should correspond to robust patterns that distinguish cases (LC) from non-cases (no LC) patients in the training dataset. When considered with the results of our linkage disequilibrium analysis, the most common motif in the most-fit models is (X₁₀ XOR X₇). Although these two genes are indeed in LD, the influential role they play suggests one or both could be highly significant in conferring LC risk, when considered in conjunction with the other LC AOP SNPs that appear in the most fit models.

Another SNP that is highly prevalent in the most fit models is X₁₉ (rs4410790; within the AHR gene). When taken into consideration with X₁₀ and X₇ (rs552976 and rs563694, respectively; both within ABCB11), the HIBACHI models suggest an epistatic interaction involving both AHR (the aryl hydrocarbon receptor; a transcription factor that plays a significant role in detecting and metabolizing xenobiotic chemicals in the liver) [14] and ABCB11 (which encodes the bile salt export pump protein, a key component in normal, healthy function of the liver) [34]. In each of the top 4 models, HIBACHI yields two motifs–one containing two SNPs within the ABCB11 gene, and the other containing at least one SNP from the AHR gene, along with another, variable number of other SNPs–that are joined by the boolean “AND” operation, meaning that both motifs must evaluate to “1” to result in an outcome of LC in the generative models. In other words, variation in multiple genes involved in the same AOP is required to observe the disease phenotype.

Although these findings are not yet supported by robust experimental evidence, they are highly biologically plausible: AHR is a key player in the liver’s toxic response, and ABCB11 governs a central role of the liver; therefore, it would make sense that an interaction between both genes helps govern risk for toxicity-mediated liver outcomes. This association could be highly significant clinically, and merits further investigation, either by investigating larger sets of observational data, performing studies in animal models, or both.

Since population-specific prevalence of alleles can affect both the learning of the model (e.g. with populations that have high- or low-prevalence alleles acting as confounders) and the generalizability of results (the study having limited benefit for populations with a low presence of implicated alleles), it is critically important to inspect the prevalence across included populations when interpreting results [8,21]. The 3 SNPs we list above (rs4410790, rs552976, and rs563694) are generally consistent across populations in the 1000 Genotypes Project Phase 3 dataset [4], with the exception of rs55297 in East Asian groups (overall variant allele frequency 0.748; East Asian VAF 0.993) and rs563694 also in East Asian groups (overall VAF 0.158; East Asian VAF 0.026). Therefore, these results may be of limited value to individuals from an East Asian background. However, since self-reported ethnicity was one of the confounders included in propensity score matching, the actual impact of this phenomenon on the HIBACHI model and the interpretation of our results should be minimal. We strongly encourage users of our methodology to carefully inspect population frequencies of alleles, particularly for SNPs that are present in the learned generative model. Whenever possible, users should also choose diverse datasets that are well-annotated with patient demographics and supported by rigorous previous analyses.

4.2. Adjusting cohorts using propensity score matching

Any statistical model learned on observational data is at risk of becoming biased due to the presence of confounding variables. We used the propensity score matching technique–which is a well-established technique in health data research–to select case and control cohorts with similar (almost identical) distributions of a number of important covariates and show in the process that these covariates are significantly unbalanced before the PSM procedure is applied. This leads to two important phenomena. First, there is minimal risk of the interacting SNPs discovered by HIBACHI to reflect associations with the covariates rather than the outcome of interest (in this case, LC). Second, the resulting models should generalize better across different clinical sub-populations. For example, since we included the Townsend deprivation score–a composite measure of material deprivation–and ethnicity in the set of PSM covariates, we can ensure that our case and control cohorts are more socioeconomically and racially balanced than if we did not match on them. Historically, failure to do so has led to study results that do not generalize to underrepresented groups. Beyond these social justice implications, failure to adjust for these factors may lead to constructing a case cohort with less access to high-quality medical care, and therefore worse outcomes or clinical data quality.

4.3. Limitations

Our analysis comes with some limitations, and they are seen through the HIBACHI model results, as the models aren’t independent experiments. Some genetic motifs in the models can be artifacts of the evolutionary process rather than meaningful genetic interactions and future HIBACHI analysis will need to account for this. We also need to dig deeper into the biological relationships between the SNPs by running validation experiments. Finally, the underlying premise behind using HIBACHI to perform this analysis is that we hope to capture the ‘most fit’ models simply by random search followed by refinement using evolutionary algorithms, which can be considered a somewhat ‘brute force’ approach. A more computationally efficient approach would involve the use of symbolic regression instead of an evolutionary algorithm to explore the search space of all potential generative models. However, we consider our current approach to be both effective and appropriate for this new area of investigation, as the behavior of genetic phenomena (with possible hidden interactions between features) is poorly understood with respect to symbolic regression, and therefore symbolic regression algorithms may not be well-adapted to this task at the current time.

4.4. Future work

We want to explore how HIBACHI works for other adverse outcomes of interest. This is the first time HIBACHI was used to interpret biological relationships from its learned models, and we need to repeat this type of analysis in other scenarios to fully characterize its ability to recognize meaningful biological relationships. The adverse outcome of interest for our future analysis is cardiovascular disease. Cardiovascular disease is of interest because the heart and blood vessels are notably affected as a result of COVID-19 infection, the disease that has caused a global pandemic for over two years. A list of cardiovascular based AOPs and SNPs have been established by queries in the AOP-DB using the same process described in the genotype section of the methods to run in the HIBACHI program for future study. Future analysis should apply the phenotypic data gathering process to different populations. UKBB was our first population of analysis, but we want to apply HIBACHI to other populations as well to explore the robust nature of genetic relationships through patient populations. Another step for future studies is to look at the individual contribution each SNP has for risk of liver cancer through statistical and experimental methods. An example would be to leverage CRISPR-Cas9 nickase technology to selectively edit candidate single nucleotides in cell culture [3] and evaluate the impact of later measurable key events that are predicted to be modulated. This method can therefore quantitate the impact of SNPs that are identified by HIBACHI and provide validation for these computational predictions.

5. Conclusions

In this study, we show that genetic programming and graph data science can be leveraged to uncover patterns of genetic regulation in adverse outcome pathways using real-world observational data. Our approach provides one of the first concrete examples of using HIBACHI – an open-source software tool originally designed to create synthetic datasets with interactions between features – on a task that increases our understanding of biological phenomena. We describe a novel association between variants in the AHR and ABCB11 gene that – when occurring simultaneously – seem to confer increased risk for liver cancer. As a side effect, we also provide a concrete example of using HIBACHI to generate synthetic versions of genetic data, which enables the sharing of genetic data without risks to patient privacy. Furthermore, the technique we use for balancing data with respect to a score of socioeconomic deprivation provides a means for improving social justice in epidemiological analyses of environmental health. We feel that this study represents the first in a larger body of work exploring how genetic programming can be used to improve our understanding of the genetic mechanisms underlying disease, as well as clinical phenomena resulting from toxic exposures.

Supplementary Material

Supplementary data 1

NIHMS1933008-supplement-Supplementary_data_1.docx^{(960.1KB, docx)}

Acknowledgements

This work is supported by the Environmental Protection Agency’s National Research Program in Chemical Safety and Sustainability, Adverse Outcome Pathway Discovery and Development (FY22 CSS AOPDD 4.3.2.2). This research has been conducted using data from UK Biobank, a major biomedical database. The work was additionally funded using grant support from the US National Institutes of Health: K99-LM013646 (PI: Romano), R01-AG066833, R01-LM010098, R01-LM013463 (PI: Moore), and P30-ES013508 (PI: Trevor Penning). We would like to thank Dr. Nisha Sipes and Dr. Brian Chorley for providing editorial review of the manuscript prior to submission.

Abbreviations:

AI: Artificial Intelligence
AOP: Adverse Outcome Pathway
AOP-DB: US Environmental Protection Agency’s Adverse Outcome Pathway Database
AOP-KB: Adverse Outcome Pathway Knowledge Base
COVID-19: Coronavirus Disease 2019
EAGMST: Extended Advisory Group on Molecular Screening and Toxicogenomics
EPA: US Environmental Protection Agency
eQTL: Expression Quantitative Trait Locus
GP: Genetic Programming
Gtex: Genotype-Tissue expression project
GWAS: Genome-Wide Association Study
HIBACHI: Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions
KE: Key Event
LC: Liver Cancer
MIE: Molecular Initiating Event
OECD: Organisation for Economic Co-operation and Development
PSM: Propensity Score Matching
SNP: Single Nucleotide Polymorphism
UK: United Kingdom
UKBB: UK Biobank

Footnotes

CRediT authorship contribution statement

Joseph D. Romano: Visualization. Liang Mei: Visualization. Jonathan Senn: Visualization. Jason H. Moore: Software. Holly M. Mortensen: Conceptualized the study methods.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

EPA Disclaimer

This manuscript has been reviewed by the Center for Public Health and Environmental Assessment, United States Environmental Protection Agency and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the Agency nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The authors declare no conflict of interest.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.comtox.2023.100261.

Data availability

Data will be made available on request.

References

[1].Aittokallio T, Graph-based methods for analysing networks in cell biology, Briefings in Bioinformatics 7 (3) (2006) 243–255, 10.1093/bib/bbl022. [DOI] [PubMed] [Google Scholar]
[2].Ankley GT, Bennett RS, Erickson RJ, Hoff DJ, Hornung MW, Johnson RD, Mount DR, Nichols JW, Russom CL, Schmieder PK, Serrrano JA, Tietge JE, Villeneuve DL, Adverse outcome pathways: A conceptual framework to support ecotoxicology research and risk assessment, Environmental Toxicology and Chemistry 29 (3) (2010) 730–741, 10.1002/etc.34. [DOI] [PubMed] [Google Scholar]
[3].Anzalone AV, Randolph PB, Davis JR, Sousa AA, Koblan LW, Levy JM, Chen PJ, Wilson C, Newby GA, Raguram A, Liu DR, Search-and-replace genome editing without double-strand breaks or donor DNA, Nature 576 (7785) (2019) 149–157, 10.1038/s41586-019-1711-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs RA, Boerwinkle E, Doddapaneni H, Han Y. i., Korchina V, Kovar C, Lee S, Muzny D, Reid JG, Zhu Y, Wang J, Chang Y, Feng Q, Fang X, Guo X, Jian M, Jiang H, Jin X, Lan T, Li G, Li J, Li Y, Liu S, Liu X, Lu Y, Ma X, Tang M, Wang B.o., Wang G, Wu H, Wu R, Xu X, Yin Y.e., Zhang D, Zhang W, Zhao J, Zhao M, Zheng X, Lander ES, Altshuler DM, Gabriel SB, Gupta N, Gharani N, Toji LH, Gerry NP, Resch AM, Flicek P, Barker J, Clarke L, Gil L, Hunt SE, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Thormann A, Toneva I, Vaughan B, Zheng-Bradley X, Bentley DR, Grocock R, Humphray S, James T, Kingsbury Z, Lehrach H, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo M-L, Mardis ER, Wilson RK, Fulton L, Fulton R, Sherry ST, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O’Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Zhang H, McVean GA, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt JP, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Auton A, Campbell CL, Kong Y.u., Marcketta A, Gibbs RA, Yu F, Antunes L, Bainbridge M, Muzny D, Sabo A, Huang Z, Wang J, Coin LJM, Fang L, Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Shao H, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Alkan C, Dal E, Kahveci F, Marth GT, Garrison EP, Kural D, Lee W-P, Fung Leong W, Stromberg M, Ward AN, Wu J, Zhang M, Daly MJ, DePristo MA, Handsaker RE, Altshuler DM, Banks E, Bhatia G, del Angel G, Gabriel SB, Genovese G, Gupta N, Li H, Kashin S, Lander ES, McCarroll SA, Nemesh JC, Poplin RE, Yoon SC, Lihm J, Makarov V, Clark AG, Gottipati S, Keinan A, Rodriguez-Flores JL, Korbel JO, Rausch T, Fritz MH, Stütz AM, Flicek P, Beal K, Clarke L, Datta A, Herrero J, McLaren WM, Ritchie GRS, Smith RE, Zerbino D, Zheng-Bradley X, Sabeti PC, Shlyakhter I, Schaffner SF, Vitti J, Cooper DN, Ball EV, Stenson PD, Bentley DR, Barnes B, Bauer M, Keira Cheetham R, Cox A, Eberle M, Humphray S, Kahn S, Murray L, Peden J, Shaw R, Kenny EE, Batzer MA, Konkel MK, Walker JA, MacArthur DG, Lek M, Sudbrak R, Amstislavskiy VS, Herwig R, Mardis ER, Ding L. i., Koboldt DC, Larson D, Ye K, Gravel S, Swaroop A, Chew E, Lappalainen T, Erlich Y, Gymrek M, Frederick Willems T, Simpson JT, Shriver MD, Rosenfeld JA, Bustamante CD, Montgomery SB, De La Vega FM, Byrnes JK, Carroll AW, DeGorter MK, Lacroute P, Maples BK, Martin AR, Moreno-Estrada A, Shringarpure SS, Zakharia F, Halperin E, Baran Y, Lee C, Cerveira E, Hwang J, Malhotra A, Plewczynski D, Radew K, Romanovitch M, Zhang C, Hyland FCL, Craig DW, Christoforides A, Homer N, Izatt T, Kurdoglu AA, Sinari SA, Squire K, Sherry ST, Xiao C, Sebat J, Antaki D, Gujral M, Noor A, Ye K, Burchard EG, Hernandez RD, Gignoux CR, Haussler D, Katzman SJ, James Kent W, Howie B, Ruiz-Linares A, Dermitzakis ET, Devine SE, Abecasis GR, Min Kang H, Kidd JM, Blackwell T, Caron S, Chen W, Emery S, Fritsche L, Fuchsberger C, Jun G, Li B, Lyons R, Scheller C, Sidore C, Song S, Sliwerska E, Taliun D, Tan A, Welch R, Kate Wing M, Zhan X, Awadalla P, Hodgkinson A, Li Y, Shi X, Quitadamo A, Lunter G, McVean GA, Marchini JL, Myers S, Churchhouse C, Delaneau O, Gupta-Hinch A, Kretzschmar W, Iqbal Z, Mathieson I, Menelaou A, Rimmer A, Xifara DK, Oleksyk TK, Fu Y, Liu X, Xiong M, Jorde L, Witherspoon D, Xing J, Eichler EE, Browning BL, Browning SR, Hormozdiari F, Sudmant PH, Khurana E, Durbin RM, Hurles ME, Tyler-Smith C, Albers CA, Ayub Q, Balasubramaniam S, Chen Y, Colonna V, Danecek P, Jostins L, Keane TM, McCarthy S, Walter K, Xue Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Y, Harmanci AO, Jin M, Lee D, Liu J, Jasmine Mu X, Zhang J, Zhang Y, Li Y, Luo R, Zhu H, Alkan C, Dal E, Kahveci F, Marth GT, Garrison EP, Kural D, Lee W-P, Ward AN, Wu J, Zhang M, McCarroll SA, Handsaker RE, Altshuler DM, Banks E, del Angel G, Genovese G, Hartl C, Li H, Kashin S, Nemesh JC, Shakir K, Yoon SC, Lihm J, Makarov V, Degenhardt J, Korbel JO, Fritz MH, Meiers S, Raeder B, Rausch T, Stütz AM, Flicek P, Paolo Casale F, Clarke L, Smith RE, Stegle O, Zheng-Bradley X, Bentley DR, Barnes B, Keira Cheetham R, Eberle M, Humphray S, Kahn S, Murray L, Shaw R, Lameijer EW, Batzer MA, Konkel MK, Walker JA, Ding L.i., Hall I, Ye K, Lacroute P, Lee C, Cerveira E, Malhotra A, Hwang J, Plewczynski D, Radew K, Romanovitch M, Zhang C, Craig DW, Homer N, Church D, Xiao C, Sebat J, Antaki D, Bafna V, Michaelson J, Ye K, Devine SE, Gardner EJ, Abecasis GR, Kidd JM, Mills RE, Dayama G, Emery S, Jun G, Shi X, Quitadamo A, Lunter G, McVean GA, Chen K, Fan X, Chong Z, Chen T, Witherspoon D, Xing J, Eichler EE, Chaisson MJ, Hormozdiari F, Huddleston J, Malig M, Nelson BJ, Sudmant PH, Parrish NF, Khurana E, Hurles ME, Blackburne B, Lindsay SJ, Ning Z, Walter K, Zhang Y, Gerstein MB, Abyzov A, Chen J, Clarke D, Lam H, Jasmine Mu X, Sisu C, Zhang J, Zhang Y, Gibbs RA, Yu F, Bainbridge M, Challis D, Evani US, Kovar C, Lu J, Muzny D, Nagaswamy U, Reid JG, Sabo A, Yu J, Guo X, Li W, Li Y, Wu R, Marth GT, Garrison EP, Fung Leong W, Ward AN, del Angel G, DePristo MA, Gabriel SB, Gupta N, Hartl C, Poplin RE, Clark AG, Rodriguez-Flores JL, Flicek P, Clarke L, Smith RE, Zheng-Bradley X, MacArthur DG, Mardis ER, Fulton R, Koboldt DC, Gravel S, Bustamante CD, Craig DW, Christoforides A, Homer N, Izatt T, Sherry ST, Xiao C, Dermitzakis ET, Abecasis GR, Min Kang H, McVean GA, Gerstein MB, Balasubramanian S, Habegger L, Yu H, Flicek P, Clarke L, Cunningham F, Dunham I, Zerbino D, Zheng-Bradley X, Lage K, Berg Jespersen J, Horn H, Montgomery SB, DeGorter MK, Khurana E, Tyler-Smith C, Chen Y, Colonna V, Xue Y, Gerstein MB, Balasubramanian S, Fu Y, Kim D, Auton A, Marcketta A, Desalle R, Narechania A, Wilson Sayres MA, Garrison EP, Handsaker RE, Kashin S, McCarroll SA, Rodriguez-Flores JL, Flicek P, Clarke L, Zheng-Bradley X, Erlich Y, Gymrek M, Frederick Willems T, Bustamante CD, Mendez FL, David Poznik G, Underhill PA, Lee C, Cerveira E, Malhotra A, Romanovitch M, Zhang C, Abecasis GR, Coin L, Shao H, Mittelman D, Tyler-Smith C, Ayub Q, Banerjee R, Cerezo M, Chen Y, Fitzgerald TW, Louzada S, Massaia A, McCarthy S, Ritchie GR, Xue Y, Yang F, Gibbs RA, Kovar C, Kalra D, Hale W, Muzny D, Reid JG, Wang J, Dan X.u., Guo X, Li G, Li Y, Ye C, Zheng X, Altshuler DM, Flicek P, Clarke L, Zheng-Bradley X, Bentley DR, Cox A, Humphray S, Kahn S, Sudbrak R, Albrecht MW, Lienhard M, Larson D, Craig DW, Izatt T, Kurdoglu AA, Sherry ST, Xiao C, Haussler D, Abecasis GR, McVean GA, Durbin RM, Balasubramaniam S, Keane TM, McCarthy S, Stalker J, Chakravarti A, Knoppers BM, Abecasis GR, Barnes KC, Beiswanger C, Burchard EG, Bustamante CD, Cai H, Cao H, Durbin RM, Gerry NP, Gharani N, Gibbs RA, Gignoux CR, Gravel S, Henn B, Jones D, Jorde L, Kaye JS, Keinan A, Kent A, Kerasidou A, Li Y, Mathias R, McVean GA, Moreno-Estrada A, Ossorio PN, Parker M, Resch AM, Rotimi CN, Royal CD, Sandoval K, Su Y, Sudbrak R, Tian Z, Tishkoff S, Toji LH, Tyler-Smith C, Via M, Wang Y, Yang H, Yang L, Zhu J, Bodmer W, Bedoya G, Ruiz-Linares A, Cai Z, Gao Y, Chu J, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Martinez-Cruzado JC, Oleksyk TK, Barnes KC, Mathias RA, Hennis A, Watson H, McKenzie C, Qadri F, LaRocque R, Sabeti PC, Zhu J, Deng X, Sabeti PC, Asogun D, Folarin O, Happi C, Omoniwa O, Stremlau M, Tariyal R, Jallow M, Sisay Joof F, Corrah T, Rockett K, Kwiatkowski D, Kooner J, Hiê’n Trâ’n Tịnh, Dunstan SJ, Thuy Hang N, Fonnie R, Garry R, Kanneh L, Moses L, Sabeti PC, Schieffelin J, Grant DS, Gallo C, Poletti G, Saleheen D, Rasheed A, Brooks LD, Felsenfeld AL, McEwen JE, Vaydylevich Y, Green ED, Duncanson A, Dunn M, Schloss JA, Wang J, Yang H, Auton A, Brooks LD, Durbin RM, Garrison EP, Min Kang H, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, A global reference for human genetic variation, Nature 526 (7571) (2015) 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Benedetto U, Head SJ, Angelini GD, Blackstone EH, Statistical primer: Propensity score matching and its alternatives, European Journal of Cardio-Thoracic Surgery: Official Journal of the European Association for Cardio-Thoracic Surgery 53 (6) (2018) 1112–1117, 10.1093/ejcts/ezy167. [DOI] [PubMed] [Google Scholar]
[6].Beyer H-G, Schwefel H-P, Evolution strategies—A comprehensive introduction, Natural Computing 1 (1) (2002) 3–52, 10.1023/A:1015059928466. [DOI] [Google Scholar]
[7].Bollobás B, Modern Graph Theory, Springer, 1998. [Google Scholar]
[8].Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, Nutland S, Howson JMM, Faham M, Moorhead M, Jones HB, Falkowski M, Hardenbol P, Willis TD, Todd JA, Population structure, differential bias and genomic control in a large-scale, case-control association study, Nature Genetics 37 (11) (2005) 1243–1246. [DOI] [PubMed] [Google Scholar]
[9].Fortunato S, Community detection in graphs, Physics Reports 486 (3–5) (2010) 75–174, 10.1016/j.physrep.2009.11.002. [DOI] [Google Scholar]
[10].François P, Siggia ED, A case study of evolutionary computation of biochemical adaptation, Physical Biology 5 (2) (2008), 026009, 10.1088/1478-3975/5/2/026009. [DOI] [PubMed] [Google Scholar]
[11].Fraser AS, Monte Carlo analyses of genetic models, Nature 181 (4603) (1958) 208–209, 10.1038/181208a0. [DOI] [PubMed] [Google Scholar]
[12].Gallo G, Pallottino S, Shortest path algorithms, Annals of Operations Research 13 (1) (1988) 1–79, 10.1007/BF02288320. [DOI] [Google Scholar]
[13].GTEx Consortium, Gamazon ER, Segrè AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, Ongen H, Konkashbaev A, Derks EM, Aguet F, Quan J, Nicolae DL, Eskin E, Kellis M, Getz G, McCarthy MI, Dermitzakis ET, Cox NJ, Ardlie KG, Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation, Nature Genetics 50 (7) (2018) 956–967, 10.1038/s41588-018-0154-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Hankinson O, The aryl hydrocarbon receptor complex, Annual Review of Pharmacology and Toxicology 35 (1995) 307–340, 10.1146/annurev.pa.35.040195.001515. [DOI] [PubMed] [Google Scholar]
[15].Huber W, Carey VJ, Long L, Falcon S, Gentleman R, Graphs in molecular biology, BMC Bioinformatics 8 (S6) (2007) S8, 10.1186/1471-2105-8-S6-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].MacFarlane AGJ, Jamshidi M.o., Tools for intelligent control: Fuzzy controllers, neural networks and genetic algorithms, Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 361 (1809) (2003) 1781–1808. [DOI] [PubMed] [Google Scholar]
[17].Kavlock RJ, Ankley G, Blancato J, Breen M, Conolly R, Dix D, Houck K, Hubal E, Judson R, Rabinowitz J, Richard A, Setzer RW, Shah I, Villeneuve D, Weber E, Computational Toxicology—A State of the Science Mini Review, Toxicological Sciences 103 (1) (2008) 14–27, 10.1093/toxsci/kfm297. [DOI] [PubMed] [Google Scholar]
[18].La Cava William, Orzechowski Patryk, Burlacu Bogdan, de Franca Fabricio Olivetti, Virgolin Marco, Ying Jin, Kommenda Michael, & Moore Jason H. (2021, June 6). Contemporary Symbolic Regression Methods and their Relative Performance. NeurIPS 2021 Track Datasets and Benchmarks (Round 1). Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=xVQMrDLyGst. [PMC free article] [PubMed] [Google Scholar]
[19].Li X, Wang X, Gao P, Diabetes Mellitus and Risk of Hepatocellular Carcinoma, BioMed Research International 2017 (2017) 5202684, 10.1155/2017/5202684. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Machiela MJ, Chanock SJ, LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants, Bioinformatics (Oxford, England) 31 (21) (2015) 3555–3557, 10.1093/bioinformatics/btv402. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A, A Survey on Bias and Fairness in Machine Learning, ACM Computing Surveys 54 (6) (2022) 1–35. [Google Scholar]
[22].Miikkulainen R, Forrest S, A biological perspective on evolutionary computation, Nature Machine Intelligence 3 (1) (2021) 9–15, 10.1038/s42256-020-00278-8. [DOI] [Google Scholar]
[23].Miller JF, Cartesian Genetic Programming, in: Miller JF (Ed.), Cartesian Genetic Programming, Springer, Berlin Heidelberg, 2011, pp. 17–34, 10.1007/978-3-642-17310-3_2. [DOI] [Google Scholar]
[24].Moore JH, Olson RS, Schmitt P, Chen Y, & Manduchi E (2018). How computational thought experiments can improve our understanding of the genetic architecture of common human diseases. The 2018 Conference on Artificial Life, 23–30. 10.1162/isal_a_00012. [DOI] [PubMed] [Google Scholar]
[25].Moore JH, Shestov M, Schmitt P, Olson RS, A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. Pacific Symposium on Biocomputing, Pacific Symposium on Biocomputing 23 (2018) 259–267. [PMC free article] [PubMed] [Google Scholar]
[26].Mortensen HM, Senn J, Levey T, Langley P, Williams AJ, The 2021 update of the EPA’s adverse outcome pathway database, Scientific Data 8 (1) (2021) 169, 10.1038/s41597-021-00962-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Needham M, Hodler AE, Graph Algorithms: Practical Examples in Apache Spark and Neo4j, 1st Ed., O’Reilly Media, 2019. [Google Scholar]
[28].Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd LC, Moore JH, Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, in: Squillero G, Burelli P (Eds.), Applications of Evolutionary Computation, Springer International Publishing, 2016, pp. 123–137, 10.1007/978-3-319-31204-0_9. [DOI] [Google Scholar]
[29].Palmer LJ, UK Biobank: Bank on it, Lancet (London, England) 369 (9578) (2007) 1980–1982, 10.1016/S0140-6736(07)60924-6. [DOI] [PubMed] [Google Scholar]
[30].Pedersen JM, Matsson P, Bergström CAS, Hoogstraate J, Noŕen A, LeCluyse EL, Artursson P , Early identification of clinically relevant drug interactions with the human bile salt export pump (BSEP/ABCB11), Toxicological Sciences: An Official Journal of the Society of Toxicology 136 (2) (2013) 328–343, 10.1093/toxsci/kft197. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Romano JD, Hao Y, Moore JH, Penning TM, Automating Predictive Toxicology Using ComptoxAI, Chemical Research in Toxicology 35 (8) (2022) 1370–1382, 10.1021/acs.chemrestox.2c00074. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Sarkar IN, Butte AJ, Lussier YA, Tarczy-Hornoch P, Ohno-Machado L, Translational bioinformatics: Linking knowledge across biological and clinical realms: Figure 1, Journal of the American Medical Informatics Association 18 (4) (2011) 354–357, 10.1136/amiajnl-2011-000245. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Slatkin M, Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future, Nature Reviews Genetics 9 (6) (2008) 477–485, 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Stieger B, Meier Y, Meier PJ, The bile salt export pump, Pflügers Archiv - European Journal of Physiology 453 (5) (2007) 611–620, 10.1007/s00424-006-0152-8. [DOI] [PubMed] [Google Scholar]
[35].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R, UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age, PLOS Medicine 12 (3) (2015) e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Tetko IV, Klambauer G, Clevert D-A, Shah I, Benfenati E, Artificial Intelligence Meets Toxicology, Chemical Research in Toxicology 35 (8) (2022) 1289–1290, 10.1021/acs.chemrestox.2c00196. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].The PRACTICAL consortium, Law, Timofeeva M, Fernandez-Rozadilla C, Broderick P, Studd J, Fernandez-Tajes J, Farrington S, Svinti V, Palles C, Orlando G, Sud A, Holroyd A, Penegar S, Theodoratou E, Vaughan-Shaw P, Campbell H, Zgaga L, Hayward C, Campbell A, Harris S, Deary IJ, Starr J, Gatcombe L, Pinna M, Briggs S, Martin L, Jaeger E, Sharma-Oates A, East J, Leedham S, Arnold R, Johnstone E, Wang H, Kerr D, Kerr R, Maughan T, Kaplan R, Al-Tassan N, Palin K, Hänninen UA, Cajuso T, Tanskanen T, Kondelin J, Kaasinen E, Sarin A-P, Eriksson JG, Rissanen H, Knekt P, Pukkala E, Jousilahti P, Salomaa V, Ripatti S, Palotie A, Renkonen-Sinisalo L, p A, Böhm J, Mecklin J-P, Buchanan DD, Win A-K, Hopper J, Jenkins ME, Lindor NM, Newcomb PA, Gallinger S, Duggan D, Casey G, Hoffmann P, Ntöhen MM, Jöckel K-H, Easton DF, Pharoah PDP, Peto J, Canzian F, Swerdlow A, Eeles RA, Kote-Jarai Z, Muir K, Pashayan N, Henderson BE, Haiman CA, Schumacher FR, Al Olama AA, Benlloch S, Berndt SI, Conti DV, Wiklund F, Chanock S, Gapstur S, Stevens VL, Tangen CM, Batra J, Clements J, Gronberg H, Schleutker J, Albanes D, Wolk A, West C, Mucci L, Cancel-Tassin G, Koutros S, Sorensen KD, Grindedal EM, Neal DE, Hamdy FC, Donovan JL, Travis RC, Hamilton RJ, Ingles SA, Rosenstein BS, Lu Y-J, Giles GG, Kibel AS, Vega A, Kogevinas M, Penney KL, Park JY, Stanford JL, Cybulski C, Nordestgaard BG, Maier C, Kim J, John EM, Teixeira MR, Neuhausen SL, De Ruyck K, Razack A, Newcomb LF, Gamulin M, Kaneva R, Usmani N, Claessens F, Townsend PA, Gago-Dominguez M, Roobol MJ, Menegaux F, Khaw K-T, Cannon-Albright L, Pandha H, Thibodeau SN, Harkin A, Allan K, McQueen J, Paul J, Iveson T, Saunders M, Butterbach K, Chang-Claude J, Hoffmeister M, Brenner H, Kirac I, Matošević P, Hofer P, Brezina S, Gsur A, Cheadle JP, Aaltonen LA, Tomlinson I, Houlston RS, Dunlop MG, Association analyses identify 31 new risk loci for colorectal cancer susceptibility, Nature Communications 10 (1) (2019), 10.1038/s41467-019-09775-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Thomas RS, Bahadori T, Buckley TJ, Cowden J, Deisenroth C, Dionisio KL, Frithsen JB, Grulke CM, Gwinn MR, Harrill JA, Higuchi M, Houck KA, Hughes MF, Hunter ES, Isaacs KK, Judson RS, Knudsen TB, Lambert JC, Linnenbrink M, Martin TM, Newton SR, Padilla S, Patlewicz G, Paul-Friedman K, Phillips KA, Richard AM, Sams R, Shafer TJ, Setzer RW, Shah I, Simmons JE, Simmons SO, Singh A, Sobus JR, Strynar M, Swank A, Tornero-Valez R, Ulrich EM, Villeneuve DL, Wambaugh JF, Wetmore BA, Williams AJ, The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency. Toxicological Sciences: An Official Journal of the Society of, Toxicology 169 (2) (2019) 317–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Townsend P, Deprivation, Journal of Social Policy 16 (2) (1987) 125–146, 10.1017/S0047279400020341. [DOI] [Google Scholar]
[40].Urbanowicz RJ, Barney N, White BC, & Moore JH (2008). Mask functions for the symbolic modeling of epistasis using genetic programming. Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation - GECCO ‘08, 339. 10.1145/1389095.1389154. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].van den Heuvel MP, Sporns O, Network hubs in the human brain, Trends in Cognitive Sciences 17 (12) (2013) 683–696, 10.1016/j.tics.2013.09.012. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary data 1

NIHMS1933008-supplement-Supplementary_data_1.docx^{(960.1KB, docx)}

Data Availability Statement

Data will be made available on request.

[R1] [1].Aittokallio T, Graph-based methods for analysing networks in cell biology, Briefings in Bioinformatics 7 (3) (2006) 243–255, 10.1093/bib/bbl022. [DOI] [PubMed] [Google Scholar]

[R2] [2].Ankley GT, Bennett RS, Erickson RJ, Hoff DJ, Hornung MW, Johnson RD, Mount DR, Nichols JW, Russom CL, Schmieder PK, Serrrano JA, Tietge JE, Villeneuve DL, Adverse outcome pathways: A conceptual framework to support ecotoxicology research and risk assessment, Environmental Toxicology and Chemistry 29 (3) (2010) 730–741, 10.1002/etc.34. [DOI] [PubMed] [Google Scholar]

[R3] [3].Anzalone AV, Randolph PB, Davis JR, Sousa AA, Koblan LW, Levy JM, Chen PJ, Wilson C, Newby GA, Raguram A, Liu DR, Search-and-replace genome editing without double-strand breaks or donor DNA, Nature 576 (7785) (2019) 149–157, 10.1038/s41586-019-1711-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Benedetto U, Head SJ, Angelini GD, Blackstone EH, Statistical primer: Propensity score matching and its alternatives, European Journal of Cardio-Thoracic Surgery: Official Journal of the European Association for Cardio-Thoracic Surgery 53 (6) (2018) 1112–1117, 10.1093/ejcts/ezy167. [DOI] [PubMed] [Google Scholar]

[R6] [6].Beyer H-G, Schwefel H-P, Evolution strategies—A comprehensive introduction, Natural Computing 1 (1) (2002) 3–52, 10.1023/A:1015059928466. [DOI] [Google Scholar]

[R7] [7].Bollobás B, Modern Graph Theory, Springer, 1998. [Google Scholar]

[R8] [8].Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, Nutland S, Howson JMM, Faham M, Moorhead M, Jones HB, Falkowski M, Hardenbol P, Willis TD, Todd JA, Population structure, differential bias and genomic control in a large-scale, case-control association study, Nature Genetics 37 (11) (2005) 1243–1246. [DOI] [PubMed] [Google Scholar]

[R9] [9].Fortunato S, Community detection in graphs, Physics Reports 486 (3–5) (2010) 75–174, 10.1016/j.physrep.2009.11.002. [DOI] [Google Scholar]

[R10] [10].François P, Siggia ED, A case study of evolutionary computation of biochemical adaptation, Physical Biology 5 (2) (2008), 026009, 10.1088/1478-3975/5/2/026009. [DOI] [PubMed] [Google Scholar]

[R11] [11].Fraser AS, Monte Carlo analyses of genetic models, Nature 181 (4603) (1958) 208–209, 10.1038/181208a0. [DOI] [PubMed] [Google Scholar]

[R12] [12].Gallo G, Pallottino S, Shortest path algorithms, Annals of Operations Research 13 (1) (1988) 1–79, 10.1007/BF02288320. [DOI] [Google Scholar]

[R13] [13].GTEx Consortium, Gamazon ER, Segrè AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, Ongen H, Konkashbaev A, Derks EM, Aguet F, Quan J, Nicolae DL, Eskin E, Kellis M, Getz G, McCarthy MI, Dermitzakis ET, Cox NJ, Ardlie KG, Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation, Nature Genetics 50 (7) (2018) 956–967, 10.1038/s41588-018-0154-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Hankinson O, The aryl hydrocarbon receptor complex, Annual Review of Pharmacology and Toxicology 35 (1995) 307–340, 10.1146/annurev.pa.35.040195.001515. [DOI] [PubMed] [Google Scholar]

[R15] [15].Huber W, Carey VJ, Long L, Falcon S, Gentleman R, Graphs in molecular biology, BMC Bioinformatics 8 (S6) (2007) S8, 10.1186/1471-2105-8-S6-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].MacFarlane AGJ, Jamshidi M.o., Tools for intelligent control: Fuzzy controllers, neural networks and genetic algorithms, Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 361 (1809) (2003) 1781–1808. [DOI] [PubMed] [Google Scholar]

[R17] [17].Kavlock RJ, Ankley G, Blancato J, Breen M, Conolly R, Dix D, Houck K, Hubal E, Judson R, Rabinowitz J, Richard A, Setzer RW, Shah I, Villeneuve D, Weber E, Computational Toxicology—A State of the Science Mini Review, Toxicological Sciences 103 (1) (2008) 14–27, 10.1093/toxsci/kfm297. [DOI] [PubMed] [Google Scholar]

[R18] [18].La Cava William, Orzechowski Patryk, Burlacu Bogdan, de Franca Fabricio Olivetti, Virgolin Marco, Ying Jin, Kommenda Michael, & Moore Jason H. (2021, June 6). Contemporary Symbolic Regression Methods and their Relative Performance. NeurIPS 2021 Track Datasets and Benchmarks (Round 1). Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=xVQMrDLyGst. [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Li X, Wang X, Gao P, Diabetes Mellitus and Risk of Hepatocellular Carcinoma, BioMed Research International 2017 (2017) 5202684, 10.1155/2017/5202684. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Machiela MJ, Chanock SJ, LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants, Bioinformatics (Oxford, England) 31 (21) (2015) 3555–3557, 10.1093/bioinformatics/btv402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A, A Survey on Bias and Fairness in Machine Learning, ACM Computing Surveys 54 (6) (2022) 1–35. [Google Scholar]

[R22] [22].Miikkulainen R, Forrest S, A biological perspective on evolutionary computation, Nature Machine Intelligence 3 (1) (2021) 9–15, 10.1038/s42256-020-00278-8. [DOI] [Google Scholar]

[R23] [23].Miller JF, Cartesian Genetic Programming, in: Miller JF (Ed.), Cartesian Genetic Programming, Springer, Berlin Heidelberg, 2011, pp. 17–34, 10.1007/978-3-642-17310-3_2. [DOI] [Google Scholar]

[R24] [24].Moore JH, Olson RS, Schmitt P, Chen Y, & Manduchi E (2018). How computational thought experiments can improve our understanding of the genetic architecture of common human diseases. The 2018 Conference on Artificial Life, 23–30. 10.1162/isal_a_00012. [DOI] [PubMed] [Google Scholar]

[R25] [25].Moore JH, Shestov M, Schmitt P, Olson RS, A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. Pacific Symposium on Biocomputing, Pacific Symposium on Biocomputing 23 (2018) 259–267. [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Mortensen HM, Senn J, Levey T, Langley P, Williams AJ, The 2021 update of the EPA’s adverse outcome pathway database, Scientific Data 8 (1) (2021) 169, 10.1038/s41597-021-00962-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Needham M, Hodler AE, Graph Algorithms: Practical Examples in Apache Spark and Neo4j, 1st Ed., O’Reilly Media, 2019. [Google Scholar]

[R28] [28].Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd LC, Moore JH, Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, in: Squillero G, Burelli P (Eds.), Applications of Evolutionary Computation, Springer International Publishing, 2016, pp. 123–137, 10.1007/978-3-319-31204-0_9. [DOI] [Google Scholar]

[R29] [29].Palmer LJ, UK Biobank: Bank on it, Lancet (London, England) 369 (9578) (2007) 1980–1982, 10.1016/S0140-6736(07)60924-6. [DOI] [PubMed] [Google Scholar]

[R30] [30].Pedersen JM, Matsson P, Bergström CAS, Hoogstraate J, Noŕen A, LeCluyse EL, Artursson P , Early identification of clinically relevant drug interactions with the human bile salt export pump (BSEP/ABCB11), Toxicological Sciences: An Official Journal of the Society of Toxicology 136 (2) (2013) 328–343, 10.1093/toxsci/kft197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Romano JD, Hao Y, Moore JH, Penning TM, Automating Predictive Toxicology Using ComptoxAI, Chemical Research in Toxicology 35 (8) (2022) 1370–1382, 10.1021/acs.chemrestox.2c00074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Sarkar IN, Butte AJ, Lussier YA, Tarczy-Hornoch P, Ohno-Machado L, Translational bioinformatics: Linking knowledge across biological and clinical realms: Figure 1, Journal of the American Medical Informatics Association 18 (4) (2011) 354–357, 10.1136/amiajnl-2011-000245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Slatkin M, Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future, Nature Reviews Genetics 9 (6) (2008) 477–485, 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Stieger B, Meier Y, Meier PJ, The bile salt export pump, Pflügers Archiv - European Journal of Physiology 453 (5) (2007) 611–620, 10.1007/s00424-006-0152-8. [DOI] [PubMed] [Google Scholar]

[R35] [35].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R, UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age, PLOS Medicine 12 (3) (2015) e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Tetko IV, Klambauer G, Clevert D-A, Shah I, Benfenati E, Artificial Intelligence Meets Toxicology, Chemical Research in Toxicology 35 (8) (2022) 1289–1290, 10.1021/acs.chemrestox.2c00196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Thomas RS, Bahadori T, Buckley TJ, Cowden J, Deisenroth C, Dionisio KL, Frithsen JB, Grulke CM, Gwinn MR, Harrill JA, Higuchi M, Houck KA, Hughes MF, Hunter ES, Isaacs KK, Judson RS, Knudsen TB, Lambert JC, Linnenbrink M, Martin TM, Newton SR, Padilla S, Patlewicz G, Paul-Friedman K, Phillips KA, Richard AM, Sams R, Shafer TJ, Setzer RW, Shah I, Simmons JE, Simmons SO, Singh A, Sobus JR, Strynar M, Swank A, Tornero-Valez R, Ulrich EM, Villeneuve DL, Wambaugh JF, Wetmore BA, Williams AJ, The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency. Toxicological Sciences: An Official Journal of the Society of, Toxicology 169 (2) (2019) 317–332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Townsend P, Deprivation, Journal of Social Policy 16 (2) (1987) 125–146, 10.1017/S0047279400020341. [DOI] [Google Scholar]

[R40] [40].Urbanowicz RJ, Barney N, White BC, & Moore JH (2008). Mask functions for the symbolic modeling of epistasis using genetic programming. Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation - GECCO ‘08, 339. 10.1145/1389095.1389154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].van den Heuvel MP, Sporns O, Network hubs in the human brain, Trends in Cognitive Sciences 17 (12) (2013) 683–696, 10.1016/j.tics.2013.09.012. [DOI] [PubMed] [Google Scholar]

PERMALINK

Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science

Joseph D Romano

Liang Mei

Jonathan Senn

Jason H Moore

Holly M Mortensen

Abstract

1. Introduction

2. Methods

2.1. Data sources

2.2. Obtaining genotypes for cohort patients

2.3. Phenotyping and assembling patient cohorts

2.4. Exploring genetic contributions to toxicity using genetic programming

3. Results

3.1. AOPs and SNPs associated with liver cancer

Fig. 1.

Fig. 2.

3.2. UKBB liver cancer cohort characteristics

Table 1.

3.3. Using HIBACHI to infer AOP-related genetic interactions

Table 2.

3.4. Estimating independence of identified SNPs via linkage disequilibrium

4. Discussion

4.1. Generative models are suggestive of epistatic interactions in conferring LC risk

4.2. Adjusting cohorts using propensity score matching

4.3. Limitations

4.4. Future work

5. Conclusions

Supplementary Material

Acknowledgements

Abbreviations:

Footnotes

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases