Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: Comput Toxicol. 2024 Mar;29:1–14. doi: 10.1016/j.comtox.2024.100301

A Comparison of Machine Learning Approaches for predicting Hepatotoxicity potential using Chemical Structure and Targeted Transcriptomic Data

Tia Tate 1, Grace Patlewicz 1, Imran Shah 1
PMCID: PMC11235188  NIHMSID: NIHMS1993325  PMID: 38993502

Abstract

Animal toxicity testing is time and resource intensive, making it difficult to keep pace with the number of substances requiring assessment. Machine learning (ML) models that use chemical structure information and high-throughput experimental data can be helpful in predicting potential toxicity . However, much of the toxicity data used to train ML models is biased with an unequal balance of positives and negatives primarily since substances selected for in vivo testing are expected to elicit some toxicity effect. To investigate the impact this bias had on predictive performance, various sampling approaches were used to balance in vivo toxicity data as part of a supervised ML workflow to predict hepatotoxicity outcomes from chemical structure and/or targeted transcriptomic data. From the chronic, subchronic, developmental, multigenerational reproductive, and subacute repeat-dose testing toxicity outcomes with a minimum of 50 positive and 50 negative substances, 18 different study-toxicity outcome combinations were evaluated in up to 7 ML models. These included Artificial Neural Networks, Random Forests, Bernouilli Naïve Bayes, Gradient Boosting, and Support Vector classification algorithms which were compared with a local approach, Generalised Read-Across (GenRA), a similarity-weighted k-Nearest Neighbour (k-NN) method. The mean CV F1 performance for unbalanced data across all classifiers and descriptors for chronic liver effects was 0.735 (0.0395 SD). Mean CV F1 performance dropped to 0.639 (0.073 SD) with over-sampling approaches though the poorer performance of KNN approaches in some cases contributed to the observed decrease (mean CV F1 performance excluding KNN was 0.697 (0.072 SD)). With under-sampling approaches, the mean CV F1 was 0.523 (0.083 SD). For developmental liver effects, the mean CV F1 performance was much lower with 0.089 (0.111 SD) for unbalanced approaches and 0.149 (0.084 SD) for under-sampling. Over-sampling approaches led to an increase in mean CV F1 performance (0.234, (0.107 SD)) for developmental liver toxicity. Model performance was found to be dependent on dataset, model type, balancing approach and feature selection. Accordingly tailoring ML workflows for predicting toxicity should consider class imbalance and rely on simpler classifiers first.

Keywords: Toxicity Reference Database (ToxRefDB), Generalised Read-across (GenRA), high throughput transcriptomics (HTTr), Machine Learning (ML)

1. Introduction

The Toxic Substances Control Act (TSCA) requires the US EPA to perform risk-based evaluations of existing chemicals [1]. If a chemical is on the TSCA chemical substance inventory (herein Inventory), then that substance is considered an “existing” chemical. The non-confidential portion of the Inventory as of February 2023 comprises 42,170 chemicals that are active in commerce of which only a small proportion have been characterised with respect to their toxicological hazards. A need therefore exists to prioritise the remaining active chemicals for targeted testing and assessment. TSCA also specifies that an effort must be made to reduce testing in vertebrate animals thereby making use of so-called New Approach Methods (NAMs) [2]. The term NAMs has emerged as a descriptive reference to any non-animal based approach (such as in vitro, in chemico, in silico) that can be used to provide information in the context of chemical hazard and risk assessment [2].

There is a growing awareness that NAMs can offer many opportunities for bridging the gap between chemical properties and the downstream biological consequences as part of an integrated testing and assessment approach (IATA) [3]. High throughput screening and high-content screening methodologies (HTS/HCS) [4], which incorporate targeted high-throughput transcriptomics data [5] and high-throughput phenotypic profiling (HTPP) [6] represent one type of NAMs. Approaches such as (quantitative) structure-activity relationships ((Q)SARs) and read-across fall within the scope of in silico NAMs [7]. QSARs, classification and regression models that link chemical molecular structural properties to physical, chemical, or biological traits (generated from such HTS/HCS assays) are increasingly and more routinely being developed using an array of different machine learning techniques. A number of researchers have used a range of different machine learning (ML) and/or deep learning (DL) techniques to predict the toxicity of substances based on their chemical structural descriptors and/or bioactivity descriptors derived from various HTS testing datasets. Herein, we highlight several examples of efforts that have been undertaken to predict liver toxicity outcomes.

Liu et al. [8] predicted three hepatotoxicity categories: hypertrophy, injury, and proliferative lesions using bioactivity (bio), chemical structure (chm), and hybrid (chm-bio) descriptors in conjunction with several classifiers Linear Discriminant Analysis (LDA), Naive Bayes (NB), Support Vector Machines (SVM), Classification and Regression Tree (CART), k-Nearest Neighbour (KNN), and an ensemble of all of them. Descriptors not evaluated in all the chemicals or only active in 5% of all chemicals were dropped for consideration. Under-sampling was performed to balance the datasets. Performance was evaluated using sensitivity, specificity and balanced accuracy (BA). Hybrid classifiers showed the best BA for all 3 categories. Bioactivity features were more sensitive than chemical descriptors whereas chemical descriptors were more specific. CART, Ensemble, and SVM classifiers gave rise to the best performing models. Nuclear receptor activation and mitochondrial functions were often found to be highly predictive of hepatotoxicity.

Liu et al [9] developed models to predict binary organ level repeat dose toxicity outcomes using structural, biological HTS hitcalls (from ToxCast) or a combination of both. ML approaches investigated included NB, k-NN, Random Forests(RF) and Support Vector Classification (SVC). Descriptors were filtered using ANOVA F-value and datasets were balanced randomly with fixed numbers of positives vs. negatives. Performance metrics evaluated with 5-fold CV were F1 score, sensitivity, specificity and accuracy. Combinations of chemical and biological features performed better at predicting a number of toxicity outcomes than chemical or biological features alone.

Deist et al. [10] investigated the utility of different ML and DL approaches to predict chemo-radiotherapy toxicity outcomes such as clinical, dosimetric and blood descriptors. They found that RF and elastic net logistic regression showed the best overall discrimination across the datasets evaluated based on the area under the curve (AUC) from a receiver operating characteristic (ROC).

He et al. [11] developed an ensemble model comprising eight different ML classifiers to predict drug-induced liver injury (DILI) using a dataset of 1254 chemicals that had been balanced using the Kennard-Stone algorithm. The best model attained had a BA of 0.783, a sensitivity of 0.818, a specificity of 0.748 and an AUC-ROC of 0.859.

Ancuceanu et al. [12] aimed to predict liver toxicity as captured in the DILI-rank dataset using Dragon 7 molecular descriptors together with a variety of feature selection and ML algorithms (including RF, decision trees, SVM). Of the 165 models developed, 79 models with reasonable performance (balanced accuracies greater than 70%) were identified and stacked using several approaches including the building of multiple meta-models. The performance of the stacked models was slightly superior to the other models developed (BA ~72% compared with ~74%).

Xu et al. [13] built prediction models for 14 human toxicity outcomes pertaining to vascular, kidney, ureter and bladder, and liver organ systems using chemical structure and Tox21 in vitro quantitative high-throughput screening (qHTS) bioactivity assay data. Several supervised ML algorithms were applied (NB, SVM, RF, extreme gradient boosting and Neural Networks). Model performance was evaluated using the following metrics: AUC-ROC, BA, and Matthews correlation coefficient (MCC). Feature selection was performed as a preprocessing step prior to modelling using three methods including Fisher’s exact test with p- value, as well as importance scores from the RF and extreme gradient boosted algorithms. The top four models, with AUC-ROC values >0.8, were derived for endocrine (0.90 ± 0.00), musculoskeletal (0.88 ± 0.02), peripheral nerve and sensation (0.85 ± 0.01), and brain and coverings (0.83 ± 0.02) toxicities, whereas the best model AUC-ROC values were >0.7 for the remaining 10 toxicities. For the musculoskeletal endpoint, an investigation was performed to test if data balancing approaches could further improve model performance. Four different ways were evaluated; down-sampling, up-sampling, Random Over Sampling (ROSE), and Synthetic Minority Oversampling Technique (SMOTE). Note down-sampling or downsampling can be otherwise referred to a under-sampling - a technique to reduce the majority of class samples to balance the class label. Similarly up-sampling (upsampling) which creates artificial or duplicate data points of the minority class sample is also known as over-sampling. Data balancing did not yield any model that outperformed the original model based on the unbalanced dataset. AUC-ROC values for the balancing approaches for the musculoskeletal endpoint ranged from 0.77 to 0.85 whereas the original model had a AUC-ROC of 0.88. Model performance was found to be dependent on the specific data set, model type, and feature selection method used.

In this study, targeted High Throughput Transcriptomic (HTTr) descriptors, chemical structure, and a hybrid of both were used to assess the performance of several supervised classification models, including Generalised Read-across (GenRA), an algorithmic read-across approach [14]; [15] for liver and selected organ-level toxicity outcomes. The performance of the models derived were compared using different balancing approaches and feature selection techniques to better understand their impact on performance metrics. Figure 1 provides a brief summary of the main components of the study.

Figure 1:

Figure 1:

Workflow describing the main components of the study.

2. Methods

2.1. Data Sources

2.1.1. In Vivo Data

The Toxicity Reference DataBase (ToxRefDB) version 2.0 [16], accessible at ftp:/newftp.epa.gov/comptox/High Throughput Screening Data/Animal Tox Data/current/ was used to retrieve in vivo animal toxicity data. ToxRefDB captures in vivo effect information from repeat-dose studies for several hundred chemicals across a wide range of species and target organs. A key refinement in version 2 of the database was the manner in which endpoints were annotated (e.g. required or not required) according to EPA’s Office of Chemical Safety and Pollution Prevention (OCSPP) and/or OECD guidelines for subacute (sac), subchronic (sub), chronic (chr), developmental (dev) and multigenerational reproductive (mgr) designs as well as the ability to delineate between negative responses from untested effects. Chemicals that produced adverse effects for an endpoint were labeled as positive (1), whereas those endpoints that were required and tested but resulted in no adverse effects were labeled as negative (0). In ToxRefDB v2, 935 compounds were identified with such positive or negative toxicity designations for 252 target organs and effects across 10 guideline repeat dose study categories namely: chronic toxicity (chr), subchronic toxicity (sub), subacute toxicity (sac), developmental toxicity (dev), multigenerational reproductive toxicity (mgr), reproductive toxicity (rep), developmental neurotoxicity (dnt), acute toxicity (acu), and neurological toxicity (neu). Toxicity studies that did not follow one of the aforementioned study types were labelled ”other” (oth). Toxicity effects were grouped by organ and study type, from which endpoints could be selected that met a minimum criteria of having at least 50 negative and 50 positive chemicals. The toxicity data used was taken from the supplementary information provided in [17].

2.1.2. Transcriptomic Hitcall Data

The transcriptomic data used in this study had been previously generated for 1,065 chemicals using the Life Technologies/Expression Analysis (LTEA) assay, as a part of the ToxCast HTS data collection [4]; [18], and is described in detail by [19]. Briefly, the LTEA assay measured expression levels of 95 genes using quantitative reverse transcription-polymerase chain reaction (qRT-PCR) in HepaRGTM cells after they were treated with 8 concentrations of 1,065 compounds for 24 hours. The ToxCast analysis pipeline package (R/tcpl) was used to evaluate concentration-response data for each of the 95 transcripts as well as cytotoxicity (as measured by the lactate dehydrogenase (LDH) test) [20]. Curve-fitting was performed twice for each gene, once for up-regulation and down-regulation. After curve-fitting, the efficacy, potency (AC50), and hitcall were recorded in a MySQL database (invitroDB v3.00) (USEPA, 2018). In brief, data processed through the R/tcpl package is used to generate concentration-response curves and estimate potency and hitcall values. Each concentration-response data set is fit to a constant, Hill curve and gain-loss model. The curve with the lowest Akaike Information Criteria (AIC) is selected as the winner. If the winning curve is either a Hill or gain-loss and the top of the curve exceeds a specified noise threshold, the hitcall for the curve is set to 1 for active and 0 as inactive. The resulting data is referred to as level 5 data per tcpl nomenclature. The hitcall for each chemical and transcript was assigned a binary active (1) or inactive (0) value based on tcpl level 5 data [19]. For the remainder of this manuscript, the LTEA hitcall data for the 1,065 chemicals and 95 transcripts are referred to as the transcriptomics data set (gene). The dataset was taken from the supplementary information reported in [17].

2.1.3. Chemical Structure Descriptors

Chemical structural descriptors could be computed for 1,017 of the 1,065 chemicals with associated transcriptomics data. Descriptors computed included 2048 Morgan fingerprints (mrgn) [21], 2048 Torsion Topological fingerprints (tptr) [22], and 729 ToxPrint (toxp) [23] chemotypes. These descriptors were encoded as binary (bit) vectors, with 1 and 0 representing the presence or absence of each structural element, respectively. Morgan and hashed Torsion Topological fingerprints were generated using the python library RD-Kit [24], whereas ToxPrints were downloaded from the EPA CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) using the batch search [25].

The chemical fingerprints selected spanned three different types - circular, topological and dictionary. Morgan circular fingerprints, also known as extended-connectivity fingerprints perceive the presence of specific circular substructures around each atom in a molecule. The approach starts by assigning each atom an integer label (atomic number) and updating these labels by aggregating them with the labels of their immediate neighbouring atoms using a hash function. The process captures the neighbourhood of the atom and is repeated over many iterations. Once complete, each atom will have an identifier that contains substructural information from all parts of the molecule. Topological torsion fingerprints on the otherhand aim to represent short-range information contained within the torsion angles of a molecule. ToxPrint chemotypes, a dictionary based fingerprint, comprise a fixed set of substructural features that include atoms, functional bonds, chains, rings, ligands, and scaffolds.

2.1.4. Hybrid Descriptors

Three sets of hybrid descriptors were created for each chemical by combining transcriptomic and chemical structure information together. In the simplest case, the transcriptomic (gene) and Morgan chemical descriptors were combined to generate a hybrid descriptor denoted “cb”. A more complex hybrid comprising all chemical information generated from all three chemical structure descriptors (Morgan, Torsion Topological, ToxPrint chemotype) was used to form a hybrid termed “ca”. Finally, the in vitro hitcall (gene) descriptor was combined with the three chemical structure descriptors (ca) to produce the hybrid descriptor “cba”.

2.2. Supervised Machine Learning

Supervised machine learning (ML) was used to classify the toxicity outcomes using the chemical, bioactivity, and hybrid descriptors generated. A variety of classification algorithms (with default hyperparameters) were employed (described in more detail in [26] and [27]), including Random Forest (RF), Bernoulli Naïve Bayes (NB), Gradient Boosting (GB), Support Vector machine classifier (SVC) with a radial basis kernel, K-Nearest Neighbours (KNN), Logistic regression (LR), Artificial Neural Networks (ANN), and the Generalised read-across (GenRA) approach.

Random Forest is an ensemble learning method that works by constructing a multiple of decision trees. For classification tasks, the output of the random forest is the class selected by most trees (majority vote). The NB algorithm is a probabilistic strategy based on the Bayes theorem that assumes all attributes are independent. The GB classification approach consists of an ensemble of ”weaker” models that may be used to optimise classification tasks by learning these models and then combining them to create a stronger model. SVC techniques seek decision limits that optimise the gap between positive and negative classes of chemicals. The SVC was trained using a radial basis kernel. The KNN technique labels an observation with the label of its nearest neighbours and decides the class using a majority vote. The defaults within the sklearn package were used where the number of neighbours was set as 5, the Minkowski metric was used as the metric and uniform weights were applied to all neighbours. LR is an extension of linear regression that models the probabilities of classifications problems using binary outcomes. ANNs are a type of machine learning that uses specific input, output, and hidden layers to create probability-weighted associations between input and output layers. The Multi-layer Perception classifier where the log-loss function was optimised using stochastic gradient descent (default setting) was used.

2.2.1. Summary of the GenRA Approach

GenRA, first described in [14], is an algorithmic read-across approach that computes the similarity weighted activity of nearest neighbours. In [14], the baseline performance of read-across predictions was evaluated using binary in vivo toxicity outcomes in repeated dose toxicity studies as extracted from ToxRefDB v1. Read-across performance (as measured by AUC-ROC) was compared and contrasted using chemical descriptors in conjunction with ToxCast bioactivity hitcall [18] information. Subsequent work has aimed to refine the approach by investigating the impact that other contexts of similarity (such as physical properties [28]) play in improving read-across performance. A more recent GenRA study attempted to explore the impact of mechanistic information from targeted transcriptomic data relative to chemical structure information [17].

GenRA is available as a web-based tool (https://ccte-genra.epa.gov/genra/; [15]) and as a standalone python library, genra-py ([29]; https://pypi.org/project/genra/). In this study, genra-py was used to predict toxicity effects using the Jaccard similarity metric and a default of 10 nearest neighbours.

2.3. Creating Balanced Data

The ratios of positive and negative chemicals were highly skewed, with fewer chemicals producing adverse effects (positives) than those without adverse effects (negatives) (Figure 2). Two endpoints (chronic-liver and developmental liver) were used as representative examples where the number of positives or the number of negatives were skewed. The imbalanced-learn (https://imbalanced-learn.org/stable/) python package was used to examine multiple under- and over-sampling balancing approaches to eliminate any bias towards the majority outcome. A number of techniques for under-sampling were attempted, including selecting examples to keep in the majority class (condensed nearest neighbors (CNN), near miss (NM)) and removing examples from the majority class (random, tomek links (TK), edited nearest neighbors (ENN)). The synthetic minority over-sampling approach was used for over-sampling (SMOTE). CNN aims to produce a condensed version of the dataset that can still correctly classifies all substances in the original dataset. It relies on a k nearest neighbour rule to iteratively decide whether a substance should be removed or not. In brief, for each substance in the dataset, CNN checks its proximity to other substances. If the substance is too similar to another already chosen, it might be excluded. The process involves going through the dataset and iteratively selecting substances that best represent each class. The concept is to retain a ‘condensed’ version of the dataset whilst still capturing the diversity and essential characteristics of the each class. NM under-samples points in the majority class at random based on their proximity to other points in the same class. One approach to NM balancing was used, NM-3 which selects samples of the majority class with the smallest average distances to the k(=3) closest instances of the minority class.

Figure 2:

Figure 2:

Positive and negative chemical distribution across in vivo guideline toxicity testing studies and target endpoints. These bar graphs indicate the number of negative (neg, blue) and positive (pos, red) compounds found in chronic (CHR), subchronic (SUB), developmental (DEV), multigenerational (MGR), and subacute (SAC) studies from left to right. On the y-axis, the target organs/endpoints are identified, and the number of positive and negative chemicals is labelled on the x-axis.

The random under-sampling method involves selecting and deleting from the majority class at random until a more balanced distribution is achieved. TK is a CNN modification in which instances from opposite classes are deleted from the same class based on Tomek Links or pairings of instances from opposite classes in close proximity. ENN under-samples a class by deleting objects whose class labels vary from the majority of their k nearest neighbours, i.e. if the number of neighbours from the minority class is predominant, one example from the majority class is deleted. SMOTE tries to achieve a more balanced distribution of classes by replicating minority class examples at random. It combines existing minority instances to create “synthetic” minority instances using linear interpolation. For each example in the minority class, these synthetic training records are constructed by randomly picking one or more of the k-nearest neighbours.

2.4. Cross-Validation Testing

Each model’s performance was evaluated using 5-fold stratified cross-validation testing, which randomly divides data sets into five equal subgroups with the same proportion of negative and positive examples. The first four subsets are used for training, while the fifth and final subsets are used for testing.

For the over- and under-sampling approaches, a pipeline was constructed within the imbalanced python package so that the under-/over- sampling was performed on the training set within each fold separately during CV testing to ensure no data leakage might occur if the over-/under- sampling was performed prior to the cross-validation.

The impact on the sampling approach for random under-sampling was investigated for the chronic-liver endpoint to see whether the number of positives vs. number of negative had any impact on F1 performance. Starting with at least 50 positives, the majority class was incrementally increased by 10 to the maximum possible to match the minority class.

For the feature selection analysis, the top 10, 20, 30, 40, 60, 70, and 80 descriptors were selected to build classifiers using the chi-squared value within the sklearn python package to measure the association between the toxicity endpoint and the descriptor.

The entire cross-validation process was repeated 5 times for each data set. This strategy allowed an assessment of the impacts of the number of features in a data set, the descriptor type, the feature selection technique, the classification algorithm, and the data balancing approach to be made on the predictive performance for each toxicity outcome.

2.5. Evaluation of Model Performance

F1 score, precision, sensitivity and specificity, were used to assess model performance. The performance metrics were summarised by the mean and standard deviation of performance aggregated across the 5-fold stratified cross-validation testing iterations. Analysis of variance (ANOVA) was used to evaluate the influence of target organ toxicities, ML classification algorithm, balancing approach, and descriptor types on F1 scores, precision, sensitivity, and specificity. First, the influence of machine learning model and descriptor type on performance measure scores was compared using one-way ANOVA. For repeated comparisons of differences between means, the Tukey’s honest significance difference (HSD) test [30] was used.

2.6. Feature Selection

The performance of the classification algorithms were first evaluated with no feature selection for each dataset and subsequently for selection of toxicity outcomes using chi-squared statistic filter feature selection. Before selecting features for use in the cross-validation loop, low variance features (coefficient of variance threshold (0.01)) were excluded. The SelectKBest method (from sklearn) scores the features against the toxicity endpoint and then keeps the most significant features according to the chi-squared statistic. A range of 10–80 features were used to build classifiers so that the relationship between the number of features and the mean CV F1 performance could be evaluated.

2.7. Data Analysis and Code Availability

The R (Version 3.6.1) and Python (Version 3.9) programming languages were used to process and analyse the data. Chemical fingerprints were created using the RDKit package [24], and genra-py was used to calculate GenRA similarity weighted activity estimates [29] (10 neighbours and the Jaccard similarity). The transcriptomic data was retrieved using the EPA Center for Computational Toxicology MySQL invitrodb (Version 3.0), which was then pre-processed for analysis in Python using the tcpl package in R [20]. Machine learning and performance assessment were handled using the scikit-learn python package (scikit-learn) [31], whilst data balancing was handled by the imbal-learn python package (imbal-learn) [32]. The completed workflow is documented in Jupyter notebooks [33] available publicly on GitHub [https://github.com/patlewig/liver-ml]. The chemical, transcriptomic and toxicity data files are provided as supplementary information. Table 1 provides a list of common abbreviations.

Table 1:

List of abbreviations

Acronym Full Name

CASRN Chemical Abstract Services Registry Number
CV Cross-Validation
DSSTox Distributed Structure-Searchable Toxicity
DTXSID DSSTox substance identifier
HCS High Content Screening
HTS High Throughput Screening
HTTr High Throughput Transcriptomics
LTEA Life Technologies/Expression Analysis
OCSPP Office of Chemical Safety and Pollution Prevention
NAM New Approach Methodology/Method
NTP National Toxicology Program
ML Machine Learning
QSAR Quantitative Structure-Activity Relationship
ToxCast Toxicity ForeCaster

3. Results and Discussion

3.1. Datasets

There were 92 toxicity outcomes across five guideline study types (chronic, subchronic, developmental, multigenerational reproductive and subacute) that contained a minimum of 50 positive and 50 negative substances (Figure 2). The ratios of positive to negative substances for most effects were skewed, with negative chemicals outnumbering positives by a substantial amount (4.17:1 on average). Among the few target effects with more prevalent positive chemicals were chronic body weight (5.3:1), subchronic body weight (5.3:1), subchronic liver (2.25:1), chronic liver (1.7:1), subacute liver (1.58:1), and subchronic kidney (1.43:1). Due to the unequal numbers of positives and negative for most data sets, initial efforts concentrated on analysing how the unbalanced data affected classifier performance. Given that the transcriptomic data had been generated in HepaRGTM cells, a human hepatic progenitor cell line, subsequent analyses concentrated on establishing the optimal strategy for reducing classification bias in the majority class of this highly skewed data utilising liver-specific endpoint targets only. Positive to negative chemical ratios (absolute numbers in brackets) for liver-specific outcomes were 2.25:1 (356:158) for subchronic, 1.7:1 (325:191) for chronic, 1.58:1 (82:52) for subacute, 1:1 (149:142) for multigenerational reproductive, and 0.2:1 (66:331) for developmental study types.

3.2. Predictive Performance using Unbalanced Datasets

Classification performance was aggregated across all descriptors using 18 unbalanced datasets with at least 50 positive and 50 negative compounds to evaluate the association between class imbalance and classifier performance. Body weight effects as well as effects in liver, kidney, spleen and lung were considered. The developmental neurotoxicity clinical signs endpoint was also included in this analysis, though there were only 49 positives and 44 negatives. Figure 3 outlines the main workflow used for this initial analysis. All classifiers performed better with respect to their mean CV F1 score when the number of positive chemicals exceeded that of the number of negative chemicals, optimising for prediction of toxicity is a key consideration. The converse was not true; when negatives outnumbered positives, all classifiers performed poorly (see Figure 4). There was no discernible difference in performance if mean CV BA was used as the metric. Spearman’s correlation analysis (denoted by r) for the number of positive compounds and mean CV F1 performance (Figure 5) demonstrated bias in all classifiers towards majority positive classes. Whilst each classifier demonstrated this bias, GenRA, Gradient Boosting, Logistic Regression, and Random Forest classifiers were the most strongly influenced by this imbalance, producing significant r values (number of positives to mean CV F1) of 0.726, 0.744, 0.714, and 0.712, respectively. The developmental neurotoxicity clinical signs outcome (which did not match the baseline criteria for numbers of compounds) appeared to be similarly biased, with more positive than negative chemicals (49:44); though this was also a better-balanced target (mean CV F1 for classifiers ranged from 0.633 for Gradient Boosting − 0.757 for ANN , mean CV BA ranged from 0.49–0.58). Indeed, Figure 6 shows how structurally similar chemicals are within this endpoint dataset. Chemicals that were positively acting (marked with red labels) clustered together which might be a possible reason to account for this particular dataset being amenable to being modelled. The multi-generational target liver outcome (142:149) fared rather well for each classifier, with an average CV F1 performance of 0.666. (Range:0.527–0.73). The subacute liver outcome fared best for each classifier aside from KNN if BA was considered as the performance metric, with a mean BA of 0.67 (Range: 0.37–0.74). Classifier performance on the basis of mean CV F1 on unaggregated and unbalanced data sets was also investigated to determine whether the association between class imbalance and prediction performance was potentially impacted by descriptor type. The type of descriptor i.e. Morgan vs. gene vs. hybrid and the size of the descriptor sets in terms of number of features appeared to have minimal effect on mean CV F1 performance. Indeed, one-way ANOVA testing on all outcomes grouped by classifier and descriptor showed no significant differences. Note: the sample size for the classifier (between groups) was 8 whereas the sample size for the residual (within groups) was 137.

Figure 3:

Figure 3:

Workflow describing the datasets created for the initial unbalanced evaluations.

Figure 4:

Figure 4:

Relationship Between Predictive Performance and Class Imbalance: Chr: Chronic, Dev: Developmental, Dnt: Developmental Neurotoxicity, Mgr: Multigenerational, Sac: Subacute, Sub: Subchronic, N+: Number of Positive Chemical, N-: Number of negative chemicals, ANN1: Artificial Neural Networks, KNN: K-Nearest Neighbor, LR: Logistic Regression, NB: Naïve Bayes, SVC: Support Vector Classifier.

Figure 5:

Figure 5:

Spearman’s correlation of mean 5-fold stratified CV F1 score and the number of positive chemicals.

Figure 6:

Figure 6:

Dendrogram following clustering on the basis of Morgan fingerprints the substances with developmental neurotoxicity toxicity data to explore the similarity within positively acting and negatively acting chemicals. Substance labels marked in red correspond to those chemicals that were positive.

Next, attention was focused on liver-specific toxicity endpoints, given the hepatic origin of the cell type used in this data set. Mean CV F1, sensitivity and precision scores were optimal for all classifiers except KNN when predicting liver toxicity outcomes with more positive chemicals. The mean CV F1, sensitivity, specificity and precision aggregated across all classifiers for outcomes in chronic, subchronic and subacute studies were 0.746, 0.823, 0.310 and 0.711 respectively. These performance scores were not affected by descriptor type or particular liver toxicity outcome. Context-specific predictive performance for liver toxicity endpoints by the KNN classifier with more positive chemicals was noted. When compared to the other descriptors, the KNN classifier performed particularly well for the transcriptomic hitcall (gene) descriptors. The mean CV F1, sensitivity, specificity and precision for KNN leveraging gene descriptors was 0.765, 0.809, 0.394 and 0.729. KNN mean CV F1 scores for all other descriptors were as low as 0.433.

All classifiers showed the highest performance across mean CV F1, sensitivity and precision for subchronic liver effects. The mean performance for each classifier for the developmental liver toxicity outcome, which comprised more negative chemicals than positive, was poor with mean CV F1, sensitivity, specificity and precision values of 0.04, 0.035, 0.96 and 0.08 respectively. The mean CV F1, sensitivity, specificity and precision was also examined across all imbalanced liver endpoints by descriptor type. Descriptor types had no impact on the performance of classifiers though it should be noted that the mean CV F1 and sensitivity performance of the KNN was highest with transcriptomic (biological) descriptors relative to the other combinations (Figure 7). One-way ANOVA testing on chronic, subchronic and subacute liver effects grouped by classifier showed no significant differences between descriptor type (Note: sample size for the classifier (between groups) was 8 and the sample size for the residual (within groups) was 17).

Figure 7:

Figure 7:

mean 5-fold stratified CV test data derived performance of liver effects aggregated by Classifier. Descriptor types had no impact on the performance of classifiers though it should be noted that the performance of the KNN was highest with transcriptomic (biological) descriptors relative to the other combinations.

3.3. Predictive Performance using Balanced Data Sets

Several approaches to balancing toxicity data were evaluated to determine the unbiased performance of each classifier for predicting in vivo toxicity. The following under-sampling procedures: condensed nearest neighbours (CNN), near miss (NM), random, tomek links (TK) and edited nearest neighbours (ENN) were compared with the over-sampling approach, synthetic minority over-sampling approach (SMOTE) relative to the original unbalanced approach. To evaluate each approach, two different use-case examples were used: the chronic liver toxicity (chr-liver) endpoint (236:positive, 128:negative) and the developmental liver (dev-liver) endpoint (43:positive, 219:negative), in which positive substances were down-sampled or negative substances were up-sampled to create balanced datasets. Mean CV performance scores were aggregated by descriptor type and classifier. Figure 8 summarises the aggregated performance for these two toxicity effects. Overall, there was more consistency across performance metrics (mean CV F1, sensitivity, specificity, precision) for the chronic liver endpoint; in particular for the SMOTE, Random, and CNN approaches. Specificity was lower in the TK and unbalanced approaches (average CV specificity of 0.396 and 0.337, respectively) relative to the other performance measures whereas specificity was higher in both the ENN (0.923) and NM (0.823) approaches. For the prediction of the developmental liver endpoint (which had more negative chemicals), NM, Random and SMOTE were amongst the most consistent and best balancing techniques for aggregated classifiers. Between the unbalanced, over-sampling techniques and the under-sampling approaches, there was a substantial variation in performance. Figure 9 shows the poor performance across the unbalanced datasets for the developmental liver effects irrespective of descriptor type. Under-sampling approaches were very variable for both chronic liver and developmental liver effects. Over-sampling SMOTE approaches were most consistent between the 2 toxicity effects and across all performance metrics. Aggregating by classifier and descriptors for the chronic liver endpoint, significant differences were observed in mean CV F1 performance (Figure 10). Mean CV F1 performance when using ENN was very variable for all classifiers. The unbalanced approach tended to perform best of all (0.735), followed by TK (0.719) and SMOTE (0.640).

Figure 8:

Figure 8:

Mean 5-fold stratified CV performance of All Classifiers Using Multiple Balancing Approaches: Chronic Liver vs Developmental Liver toxicity. After balancing the chronic (chr liver) and developmental (dev liver) liver toxicity endpoints with several under-sampling and oversampling methodologies, this figure summarises the performance following 5-fold cross validation testing. F1, sensitivity, specificity, and precision for aggregated classifier and descriptors using the Near Miss (NM), Tomek Links (TK), condensed nearest neighbor (CNN), Edited Nearest Neighbor (ENN), Random under-sampling approaches to the Synthetic Minority Oversampling Technique (SMOTE) and unbalanced approaches. There was a greater consistency across performance metrics for the chronic liver endpoint overall; nevertheless, the SMOTE, Random and unbalanced methods performed better overall. Specificity was lower and more variable when using the TK and unbalanced methods. For the prediction of the developmental liver endpoint (which was more negative chemical rich), NM, Random and SMOTE were most consistent for aggregated classifiers, with SMOTE seeing a loss in specificity.

Figure 9:

Figure 9:

Summary of Chronic and Developmental Liver Toxicity Predictive mean 5-fold stratified CV Performance across Classifiers aggregated by sampling approach. Overall unbalanced approaches fare worse for the developmental liver toxicity endpoint. Under-sampling approaches were highly variable across and within performance metrics irrespective of toxicity endpoint.

Figure 10:

Figure 10:

Summary of Chronic Liver Toxicity Predictive mean 5-fold stratified CV Performance aggregated by Classifier and descriptors. Significant differences were observed in mean CV F1 performance.

Aggregating by classifier and descriptors for the developmental liver endpoint, significant differences were observed in mean CV F1 performance (Figure 11). The unbalanced approach tended to perform poorly together with the CNN and TK. Random performed the best (0.267) followed by SMOTE (0.234) and NM (0.205).

Figure 11:

Figure 11:

Summary of Developmental Liver Toxicity Predictive mean 5-fold stratified CV Performance aggregated by Classifier and descriptors. Significant differences were observed in mean CV F1 performance.

Depending on toxicity endpoint distribution of positive and negative chemicals, SMOTE showed the largest improvement relative to the unbalanced approach. For situations where there were more negative chemicals than positive, SMOTE and random often improved performance metrics (mean CV F1 and sensitivity) (Figure 12). Where there were more positive chemicals than negative chemicals, SMOTE and random mean CV F1 performance was commensurate with the unbalanced approach and sensitivity also increased. The unbalanced approach appeared to be more sensitive as reflected by the lower specificity scores.

Figure 12:

Figure 12:

Comparison of mean 5-fold stratified CV performance between SMOTE, random and unbalanced approaches.

3.4. Posthoc balancing

Random under-sampling was evaluated by incrementally increasing the majority (positive) class to the maximum possible to match the minority class for the chronic liver endpoint. Initially a minimum of 50 positive chemicals were used which was increased by 10 to a maximum whilst matching the number of negative chemicals. There was no significant difference in mean CV F1 performance with increasing the number of positive chemicals when aggregating by classifier (see Figure 13). One-way ANOVA testing on chronic, liver effects grouped by classifier and descriptor also showed no significant differences between this sampling strategy.

Figure 13:

Figure 13:

Comparison of mean 5-fold CV stratified F1 performance between different random under-sampling strategies across different classifiers for the chronic liver endpoint.

3.5. Feature selection

For unbalanced data, the mean CV F1 performance across all descriptors and across the range from the top 10–80 features was 0.739 (0.043 SD) for chronic liver and 0.038 (0.054 SD) for developmental liver. This was barely different when compared with using all descriptors (0.735, 0.0395 SD) for the chronic liver outcome but a poorer performance for developmental liver (c.f 0.089, 0.111 SD). For balanced data using an under-sampling approach, the mean CV F1 performance across all descriptors was slightly higher for chronic liver effects 0.607 (0.079 SD) but slightly lower for developmental liver 0.128, 0.092 SD). For balanced data using an over-sampling approach, the mean CV F1 performance across all descriptors was higher for chronic liver effects 0.649 (0.069 SD) but lower for developmental liver (0.192, 0.098 SD). Figure 14 summarises all the performance metrics. Unbalanced approaches lead to higher performance overall for the chronic liver effect, whereas over-sampling approaches lead to better performance for the developmental liver toxicity outcome. Variation between descriptor type was more conserved for the over and unbalanced approaches whereas there was a significant difference in mean CV F1 performance for under-sampling approaches for the chronic liver endpoint. The hybrid descriptors gave rise to higher mean CV F1 performance scores compared with chemical only features (cba hybrid mean CV F1 performance was 0.655 vs. 0.55 for Morgan features only). For the developmental liver endpoint, there was a significant difference in CV F1 performance in the over-sampling and under-sampling approaches across the descriptors with the transcriptomic only giving rise to the highest mean F1 score (0.23). When exploring the number of top features as a function of mean CV F1 performance for the 2 endpoints aggregated by classifier, descriptors and balancing strategy, no change in performance was noted for the chronic liver endpoint whereas for the developmental liver endpoint, there was a slight increase with 20 features (see Figure 15).

Figure 14:

Figure 14:

Comparison of mean 5-fold CV stratified performance using feature selection between balancing approaches across different classifiers for the chronic liver and developmental liver endpoint.

Figure 15:

Figure 15:

Comparison of mean CV stratified F1 performance as a function of the top k features aggregated across different classifiers and descriptors and sampling strategies for the chronic liver and developmental liver endpoint.

3.6. Prediction performance

In the case of the chronic liver endpoint where the number of positives outweighed the negatives, overall mean CV F1 performance was highest using an unbalanced strategy and feature selection though practically there was little difference if no feature selection was performed. Over- and under- sampling approaches served to decrease mean CV F1 performance. On the otherhand, for the developmental liver toxicity outcome where the number of negatives exceeded the number of positives, SMOTE overbalancing strategies with all features resulted in the highest mean CV F1 performance. SMOTE appears to remove bias in two main ways: 1) by generating similar examples to the existing minority points, it creates larger and less specific decision boundaries that increase the generalisation capabilities of classifiers, therefore increasing their performance. 2) However, SMOTE has some associated issues, such as the problem of overgeneralisation (the new synthetic examples may be generated in overlapping areas) and also the possibility of augmenting noisy regions (since no distinction between different types of minority examples is performed). Despite this, it seems that its ability to generate larger decision boundaries is still a major strength.

3.7. Conclusions

In this study, a systematic analysis of the impact of descriptor, machine learning technique, balancing and feature selection on mean CV F1 performance was undertaken. After evaluating overall performance across a selection of toxicity outcomes where there were at least 50 positives and 50 negatives, a more focused study was undertaken for the chronic and developmental liver endpoints. These endpoints were selected based on their dataset size and because they reflected the 2 extremes of the number of positives outweighing negative outcomes and vice versa. Unbalanced approaches appeared to be best performing for the chronic liver endpoint vs. over-sampling approaches for the developmental liver endpoint. Feature selection offered no tangible advantages for either endpoint in terms of performance though it is recognised that fewer features can offer advantages in model interpretability. Simpler ML approaches such as GenRA fared well against more complex algorithms such as Random Forest or Gradient Boosted Trees for both endpoints. Overall, this study has highlighted the importance of considering class imbalance in tailoring ML workflows for predicting toxicity.

Supplementary Material

Supplement1

Highlights.

  • Toxicity data is often biased with more positive outcomes

  • Various machine learning (ML) approaches were used to predict in vivo toxicity outcomes

  • Various balancing approaches were systematically evaluated

  • For toxicity effects with more negative results, over-sampling tended to result in better performance

  • Tailoring ML workflows for toxicity should consider class imbalance and simpler classifiers first

Footnotes

3.8

Disclaimer

The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

3.9

Competing financial interests

The authors declare they have no actual or potential competing financial interests. Grace Patlewicz has served as an editor for special issues of this journal.

References

  • [1].EPA U, “Lautenberg chemical safety act,” 2016. [Google Scholar]
  • [2].EPA U, “New approach methods work plan,” 2021. [Google Scholar]
  • [3].Patlewicz G and Fitzpatrick JM, “Current and future perspectives on the development, evaluation, and application of in silico approaches for predicting toxicity,” Chem Res Toxicol, vol. 29, no. 4, pp. 438–451, 2016. [DOI] [PubMed] [Google Scholar]
  • [4].Houck KA, Richard AM, Judson RS, Martin MT, Reif DM, and Shah I, “ToxCast: Predicting toxicity potential through high-throughput bioactivity profiling,” 2013. Section: 1 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118538203.ch1. [Google Scholar]
  • [5].Harrill JA, Everett LJ, Haggard DE, Sheffield T, Bundy JL, Willis CM, Thomas RS, Shah I, and Judson RS, “High-throughput transcriptomics platform for screening environmental chemicals,” Toxicol Sci, vol. 181, no. 1, pp. 68–89, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Nyffeler J, Willis C, Lougee R, Richard A, Paul-Friedman K, and Harrill JA, “Bioactivity screening of environmental chemicals using imaging-based high-throughput phenotypic profiling,” Toxicol Appl Pharmacol, vol. 389, p. 114876, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Thomas RS, Bahadori T, Buckley TJ, Cowden J, Deisenroth C, Dionisio KL, Frithsen JB, Grulke CM, Gwinn MR, Harrill JA, Higuchi M, Houck KA, Hughes MF, Hunter ES, Isaacs KK, Judson RS, Knudsen TB, Lambert JC, Linnenbrink M, Martin TM, Newton SR, Padilla S, Patlewicz G, Paul-Friedman K, Phillips KA, Richard AM, Sams R, Shafer TJ, Setzer RW, Shah I, Simmons JE, Simmons SO, Singh A, Sobus JR, Strynar M, Swank A, Tornero-Valez R, Ulrich EM, Villeneuve DL, Wambaugh JF, Wetmore BA, and Williams AJ, “The next generation blueprint of computational toxicology at the u.s. environmental protection agency,” Toxicol Sci, vol. 169, no. 2, pp. 317–332, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Liu J, Mansouri K, Judson RS, Martin MT, Hong H, Chen M, Xu X, Thomas RS, and Shah I, “Predicting hepatotoxicity using ToxCast in vitro bioactivity and chemical structure,” Chem Res Toxicol, vol. 28, no. 4, pp. 738–751, 2015. [DOI] [PubMed] [Google Scholar]
  • [9].Liu J, Patlewicz G, Williams AJ, Thomas RS, and Shah I, “Predicting organ toxicity using in vitro bioactivity data and chemical structure,” Chem. Res. Toxicol, vol. 30, no. 11, pp. 2046–2059, 2017. Publisher: American Chemical Society. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Deist TM, Dankers FJWM, Valdes G, Wijsman R, Hsu I-C, Oberije C, Lustberg T, van Soest J, Hoebers F, Jochems A, El Naqa I, Wee L, Morin O, Raleigh DR, Bots W, Kaanders JH, Belderbos J, Kwint M, Solberg T, Monshouwer R, Bussink J, Dekker A, and Lambin P, “Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers,” Medical Physics, vol. 45, no. 7, pp. 3449–3459, 2018. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/mp.12967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].He S, Ye T, Wang R, Zhang C, Zhang X, Sun G, and Sun X, “An in silico model for predicting drug-induced hepatotoxicity,” International Journal of Molecular Sciences, vol. 20, no. 8, p. 1897, 2019. Number: 8 Publisher: Multidisciplinary Digital Publishing Institute. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Ancuceanu R, Hovanet MV, Anghel AI, Furtunescu F, Neagu M, Constantin C, and Dinu M, “Computational models using multiple machine learning algorithms for predicting drug hepatotoxicity with the DILIrank dataset,” Int J Mol Sci, vol. 21, no. 6, p. 2114, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Xu T, Ngan DK, Ye L, Xia M, Xie HQ, Zhao B, Simeonov A, and Huang R, “Predictive models for human organ toxicity based on in vitro bioactivity data and chemical structure,” Chem. Res. Toxicol, vol. 33, no. 3, pp. 731–741, 2020. Publisher: American Chemical Society. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Shah I, Liu J, Judson RS, Thomas RS, and Patlewicz G, “Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information,” Regul Toxicol Pharmacol, vol. 79, pp. 12–24, 2016. [DOI] [PubMed] [Google Scholar]
  • [15].Patlewicz G and Shah I, “Towards systematic read-across using generalised read-across (GenRA),” Computational Toxicology, vol. 25, p. 100258, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Watford S, Ly Pham L, Wignall J, Shin R, Martin MT, and Friedman KP, “ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology analyses,” Reprod Toxicol, vol. 89, pp. 145–158, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Tate T, Wambaugh J, Patlewicz G, and Shah I, “Repeat-dose toxicity prediction with generalized read-across (GenRA) using targeted transcriptomic data: A proof-of-concept case study,” Comput Toxicol, vol. 19, pp. 1–12, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Richard AM, Judson RS, Houck KA, Grulke CM, Volarath P, Thillainadarajah I, Yang C, Rathman J, Martin MT, Wambaugh JF, Knudsen TB, Kancherla J, Mansouri K, Patlewicz G, Williams AJ, Little SB, Crofton KM, and Thomas RS, “ToxCast chemical landscape: Paving the road to 21st century toxicology,” Chem Res Toxicol, vol. 29, no. 8, pp. 1225–1251, 2016. [DOI] [PubMed] [Google Scholar]
  • [19].Franzosa JA, Bonzo JA, Jack J, Baker NC, Kothiya P, Witek RP, Hurban P, Siferd S, Hester S, Shah I, Ferguson SS, Houck KA, and Wambaugh JF, “High-throughput toxicogenomic screening of chemicals in the environment using metabolically competent hepatic cell cultures,” NPJ Syst Biol Appl, vol. 7, no. 1, p. 7, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Filer DL, Kothiya P, Setzer RW, Judson RS, and Martin MT, “tcpl: the ToxCast pipeline for high-throughput screening data,” Bioinformatics, vol. 33, no. 4, pp. 618–620, 2017. [DOI] [PubMed] [Google Scholar]
  • [21].Rogers D and Hahn M, “Extended-connectivity fingerprints,” J. Chem. Inf. Model, vol. 50, no. 5, pp. 742–754, 2010. Publisher: American Chemical Society. [DOI] [PubMed] [Google Scholar]
  • [22].Nilakantan R, Bauman N, Dixon JS, and Venkataraghavan R, “Topological torsion: a new molecular descriptor for SAR applications. comparison with other descriptors,” J. Chem. Inf. Comput. Sci, vol. 27, no. 2, pp. 82–85, 1987. Publisher: American Chemical Society. [Google Scholar]
  • [23].Yang C, Tarkhov A, Marusczyk J, Bienfait B, Gasteiger J, Kleinoeder T, Magdziarz T, Sacher O, Schwab CH, Schwoebel J, Terfloth L, Arvidson K, Richard A, Worth A, and Rathman J, “New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modeling,” J Chem Inf Model, vol. 55, no. 3, pp. 510–528, 2015. [DOI] [PubMed] [Google Scholar]
  • [24].Landrum GL, “RDKit: Open-source cheminformatics.” [Google Scholar]
  • [25].Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS, and Richard AM, “The CompTox chemistry dashboard: a community data resource for environmental chemistry,” J Cheminform, vol. 9, no. 1, p. 61, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Müller A and Guido S, Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, 1st edition ed., 2016. [Google Scholar]
  • [27].Géron A, Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 1st edition ed., 2017. [Google Scholar]
  • [28].Helman G, Shah I, and Patlewicz G, “Extending the generalised read-across approach (GenRA): A systematic analysis of the impact of physicochemical property information on read-across performance,” Comput Toxicol, vol. 8, pp. 34–50, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Shah I, Tate T, and Patlewicz G, “Generalized read-across prediction using genra-py,” Bioinformatics, vol. 37, no. 19, pp. 3380–3381, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Tukey JW, “Comparing individual means in the analysis of variance,” Biometrics, vol. 5, no. 2, pp. 99–114, 1949. [PubMed] [Google Scholar]
  • [31].Pedregosa F and Varoquaux G, “Scikit-learn: Machine Learning in Python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Google Scholar]
  • [32].Lemaître G, Nogueira F, and Aridas CK, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017. [Google Scholar]
  • [33].Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, and Willing C, “Jupyter notebooks – a publishing format for reproducible computational workflows,” in Positioning and Power in Academic Publishing:Players, Agents and Agendas (Loizides F and Schmidt B, eds.), pp. 87–90, IOS Press, 2016. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1

RESOURCES