Abstract
Hits from high-throughput screening (HTS) of chemical libraries are often false positives due to their interference with the assay detection technology. In response, we generated the largest publicly available library of chemical liabilities and developed “Liability Predictor,” a free webtool to predict HTS artifacts. More specifically, we generated, curated, and integrated HTS datasets for thiol reactivity, redox activity, and luciferase (firefly and nano) activity and developed and validated Quantitative Structure-Interference Relationship (QSIR) models to predict these nuisance behaviors. The resulting models showed 58–78% external balanced accuracy for 256 external compounds per assay. QSIR models developed and validated herein identify nuisance compounds among experimental hits more reliably than popular PAINS filters. Both the models and the curated datasets were implemented in “Liability Predictor,” publicly available at https://liability.mml.unc.edu/. “Liability Predictor” may be used as part of chemical library design or for triaging HTS hits.
Graphical Abstract

Introduction
High-throughput screening (HTS) technology has enabled the routine testing of large chemical libraries in an effort to discover novel hit compounds.1 HTS campaigns, however, are stymied by the presence of false positives, or assay artifacts, which are compounds that appear active in primary screens but show no activity in confirmatory assays.2 False positives mimic a desired biological response but do not interact with the assayed target(s) of interest either specifically or, often, at all. This phenomenon is known as assay interference, and it can unfortunately persist into hit-to-lead optimization,3 resulting in a significant waste of resources. Effective identification and elimination of false positives is thus a crucial component of triaging HTS hits. Assay interference mechanisms include but are not limited to (a) chemical reactivity; (b) interference with luciferase reporter enzymes; (c) aggregation; (d) compound-mediated assay interferences in homogenous proximity assays, and (e) interference with fluorescence and absorbance (for which no model is needed, as it can be traced by changing the fluorophore and red-shifting the spectral window).4 Assays and their interference mechanisms are summarized in Table 1 and described in further detail in the following paragraph.
Table 1.
Most frequent assay interference mechanisms and common assays.
| Type | Assays |
|---|---|
|
| |
| Assay Interference by Chemical Reactivity | (E)-2-(4-mercaptostyryl)-1,3,3-trimethyl-3H-indol-1-ium (MSTI) fluorescence reactivity assay5 and redox activity assay. |
| Interference with Luciferase Reporter Enzymes | Luciferase firefly, luciferase nano. |
| Assay Interference by Aggregation | AmpC β-lactamase inhibition cruzain inhibition. |
| Interference with Fluorescence and Absorbance | Transcreener (BellBrook Labs), FP assay for peptidyl prolyl isomerase 1 (Pin1), TR-FRET, Differential Scanning Fluorimetry (DSF), DTNB, malate dehydrogenase (MDH), HTS assay for inhibitors of TGR/PRX2 |
| Compound-Mediated Assay Interferences in Homogeneous Proximity Assays | Amplified luminescent proximity homogeneous assays (ALPHA, which is trademarked by PerkinElmer), Förster/fluorescence resonance energy transfer (FRET), time-resolved FRET (TR-FRET), and homogeneous time-resolved fluorescence (HTRF, which is trademarked by CisBio), bioluminescence resonance energy transfer (BRET), and scintillation proximity assays (SPA).6 |
Nonspecific chemical reactivity, caused by thiol-reactive compounds (TRCs) and redox cycling compounds (RCCs), occurs when the compound of interest undergoes unwanted chemical reactions with target biomolecules or assay reagents.7 TRCs, as their name suggests, covalently modify cysteine residues by exploiting the nucleophilicity of thiol side chains. This leads to nonspecific interactions in cell-based assays and/or on-target modifications in biochemical assays.8
RCCs, on the other hand, are far more insidious and less likely than TRCs to result in an actionable hit, regardless of the associated liabilities. RCCs produce hydrogen peroxide (H2O2) in the presence of strong reducing agents found in buffers that are used to maintain the structural integrity and catalytic activity of target proteins in HTS.8 The generated H2O2 can oxidize accessible (seleno)cysteine, histidine, methionine, and/or tryptophan residues of the target protein, thereby indirectly modulating activity. Moreover, considering the importance of H2O2 as the secondary messenger in many signaling pathways, RCCs are particularly problematic and confounding for cell based phenotypic HTS campaigns.7
Assay artifacts may also elicit spurious activity by inhibiting the reporter protein. One common reporter protein is luciferase, which is an enzyme that catalyzes the oxidation of a substrate, in turn producing bioluminescence. Luciferase is widely used as a reporter in studies that investigate gene regulation and function, as well as in those that aim to measure the bioactivity of chemicals.9 Several drug targets, such as GPCRs and nuclear receptors, are associated with the regulation of gene transcription. When the luciferase gene sequence binds to the sequence of a gene of interest, the cellular event can be detected by a luminescent signal. For this reason, luciferases have been widely employed in HTS.10 However, several compounds are known to inhibit luciferases leading to a false positive readout.11
Some compounds present poor solubility and aggregate at screening concentrations above the critical aggregation concentration.12 Compounds that form aggregates in situ can nonspecifically perturb biomolecules in biochemical and cell-based assays.13 In fact, aggregation is the most common cause of assay artifacts in HTS campaigns,14,15 and we previously coined the term “small, colloidally aggregating molecules” (SCAMs) to refer to those compounds.16 Many biochemical and cell-based assays utilize fluorescence and, to a lesser extent, absorbance readouts.16 Given that there are many available fluorophores that span a wide energy spectrum, it is important to select the appropriate fluorophore conditions for a given assay to minimize assay interference.16 This can help minimize the number of false positives that occur as a result of small molecules in screening libraries being fluorescent themselves. Due to multiple factors that contribute to fluorescence signal independent of compound structure, developing QSIR models might be not suitable for the fluorescence artifact modeling. Nevertheless, as we showed previously,4 utilizing readouts in the far-red spectrum for HTS assays leads to a dramatic reduction in interference. Similarly, in absorbance assays, colored compounds can interfere with the detection method depending on the concentration and extinction coefficient.4
Lastly, some assay technologies are susceptible to a variety of technology-related, compound-mediated interferences, most notably signal attenuation (e.g., via quenching, inner-filter effects, light scattering), signal emission (e.g., via auto-fluorescence), and disruption of affinity capture components such as affinity tags and antibodies. These assays are also susceptible to more generalized compound-mediated interferences such as nonspecific reactivity and aggregation.6
Assay interference, regardless of the HTS platform, can potentially inundate HTS hit lists with false positives and hinder drug discovery efforts if not properly triaged.7,8 Computational methods have been developed to assist in the detection and removal of these interference compounds from HTS hit lists and screening libraries. The most widely used computational tool for flagging suspected false positives are Pan-Assay INterference compoundS (PAINS) filters, a set of 480 substructural alerts associated with an array of assay interference mechanisms, including thiol reactivity and redox cycling.17
We18 and others8,19 previously observed that PAINS filters are oversensitive and disproportionately flag compounds as interference compounds, i.e., potential false positives, while failing to identify a majority of truly interfering compounds. This occurs because chemical fragments do not act independently from their respective structural surroundings; it is the interplay between chemical structure and its surroundings that affects the properties and activity of a compound.20 Moreover, we previously observed that more than half of the original list of PAINS alerts were derived from only one or two compounds and more than 30% of them were single compounds with “pan-assay” activity.18
In recent years, there have been efforts to model specific mechanisms of assay interference while also avoiding the limitations of PAINS filters and other substructural alerts. These models seek to overcome the limitations of substructural alerts while also providing assay interference endpoints with higher predictive power. Examples include Luciferase Advisor,21 which predicts luciferase inhibitors in luciferase-based assays; SCAM Detective,16 which predicts colloidal aggregators — the most common source of false positives in HTS campaigns— and InterPred,22 which predicts compounds that exhibit autofluorescence and luminescence interference.
Despite recent progress, curated chemical datasets of the most common assay interference protocols, including thiol reactivity, redox activity, luciferase interference, and colloidal aggregation (see Table 1) do not exist. Using experimental HTS assay results, we have developed QSIR models for three of the most prevalent and vexing mechanisms of assay interference: fluorescence-based thiol-reactivity, redox-activity, and interference with reporter enzymes (luciferase firefly and nano). Moreover, we provide an alternative solution to the widely used but frequently inaccurate structural alert-based approaches that attempt to identify putative PAINS compounds. Specifically, we (i) generated the largest publicly available HTS datasets for chemical interference liabilities; (ii) curated and integrated the interference data; (iii) developed QSIR models to predict compounds exhibiting luciferase (firefly and nano) inhibitory activity, thiol reactivity, and redox activity; (iv) virtually profiled the in-house NCATS library of 63,941 compounds; (v) selected and experimentally tested 256 virtual hits for each assay; and (vi) developed a web application (Liability Predictor) to predict artifacts for each assay.
Materials and methods
The overarching study design is depicted in Figure 1. Briefly, data were first experimentally collected from serval HTS assays. The resulting data were curated and analyzed prior to being used for training QSIR. These models were then validated experimentally.
Figure 1.
Study design.
HTS data generation
The NCATS Pharmacologically Active Chemical Toolbox (NPACT) dataset,23 which contains more than 11,000 compounds, was selected for qHTS to generate interference data. Due to limited compound availability, we only screened 5,098 compounds through four qHTS campaigns with three underlying mechanisms of assay interference: thiol reactivity, redox activity, and luciferase interference (firefly and nano). The compounds were subjected to quality control by LC/UV, LC/MS, or Hi-res MS. All compounds exhibited >90% purity by peak area or m/z. All generated experimental data, including assigned class-curves, is publicly available and can be found in the PubChem database (see Table S1 for PubChem ids).
Fluorescence-based thiol-reactive assay
Assays determining (E)-2-(4-mercaptostyryl)-1,3,3-trimethyl-3H-indol-1-ium (MSTI) reactivity were performed as previously described in the literature5 with modifications for the quick identification of compounds reacting with thiol. In brief, 4 μL of MSTI or positive control Acetyl-MSTI (final concentrations of 4 μM and 4 μM, respectively) in assay buffer (2% DMSO in 1X PBS pH 7.4) was dispensed into all wells of a 1536-well assay plate (solid-bottom black plate; Greiner Bio One, Monroe, NC). A Wako Pin-tool (Wako Automation, Richmond, VA) was used to transfer 16 nL of compounds (consisting of 7-plates in qHTS format, final concentration range 18.3 nM to 114 μM) or control MLS001163887 (final concentration range 1.22 nM to 40 μM). Plates were centrifuged (121xg, 15 seconds) and incubated for 1 hour at room temperature protected from light. Samples were read for fluorescence (Ex/Em = 525(20)/598(25) nm) on a ViewLux CCD imager (PerkinElmer). Data were normalized to Acetyl-MSTI (0% activity) and MSTI (100% activity) and the resulting percent inhibition data were fitted to a 4-parameter Hill equation using NCATS in-house software.
Redox activity assay
In short, 3 μL of a mixture containing Amplex Red (25 μM final concentration) and horseradish peroxidase (HRP; 250 mU/mL final concentration) in assay buffer (HBSS) was dispensed into all wells of a 1536-well assay plate (solid-bottom black plate; Greiner Bio One, Monroe, NC). A Wako Pin-tool (Wako Automation, Richmond, VA) was used to transfer 16 nL of compounds (consisting of 7-plates in qHTS format, final concentration range 18.3 nM to 114 μM) or control DA3003 (final concentration range 1.22 nM to 40 μM). Following a 15-min incubation at room temperature protected from light, 1 μL of DTT (prepared fresh; final concentration 250 μM) was dispensed for a final assay volume of 4 μL. Plates were centrifuged (121xg, 15 seconds) and incubated for 30 minutes at room temperature protected from light. Samples were read for fluorescence (Ex/Em = 525(20)/598(25) nm) on a ViewLux CCD imager (PerkinElmer). Data were normalized to DMSO (0% activity) and DA3003 (10 μM final concentration; 100% activity) and the resulting percent inhibition data were fitted to a 4-parameter Hill equation using NCATS in-house software.
Firefly Luciferase (FLuc) assay
In brief, 3 μL of luciferase substrate mixture containing D-luciferin and ATP (final concentrations of 10 μM and 10 μM, respectively) in assay buffer (10 mM Mg-acetate, 0.01% Tween-20, 0.05% bovine serum albumin, 50 mM Tris acetate, pH 7.6) was dispensed into all wells of a 1536-well assay plate (solid-bottom white plate; Greiner Bio One, Monroe, NC). A Wako Pin-tool (Wako Automation, Richmond, VA) was used to transfer 16 nL of compounds (consisting of 7-plates in qHTS format, final concentration range 18.3 nM to 114 μM) or control PTC124 (final concentration range 1.22 nM to 40 μM). Following a 15-min incubation at room temperature protected from light, 1 μL of luciferase enzyme (final concentration 10 nM, Photinus pyralis luciferase) or buffer was dispensed for a final assay volume of 4 μL. Plates were centrifuged (121xg, 15 seconds) and incubated for 5 minutes. Samples were read for luminescence on a ViewLux CCD imager (PerkinElmer). Data were normalized to no-enzyme (0% activity) and no-inhibitor (DMSO; 100%) controls and the resulting percent inhibition data were fitted to a 4-parameter Hill equation using NCATS in-house software.
Nanoluciferase (NLuc) assay
Medium was collected from the culture flasks of NLuc-expressing S16 cells (80% confluent) prior to cell plating, filtered through a 0.22μm filter, and frozen at −20°C prior to use. Media was diluted (1:55) and 2 μL was dispensed into all wells of a 1536-well assay plate (solid-bottom white plate; Greiner Bio One, Monroe, NC). A Wako Pin-tool (Wako Automation, Richmond, VA) was used to transfer 16 nL of compounds (consisting of 7-plates in qHTS format, final concentration range 18.3 nM to 114 μM) or control Cilnidipine (final concentration range 623 nM to 80 μM). Following a 15-min incubation at room temperature protected from light, 2 μL of NLuc substrate (1:100 in Nano-GloÒ luciferase buffer; Nano-GloÒ luminescence assay, Promega, Madison, WI) was dispensed for a final assay volume of 4 μL. Plates were centrifuged (121xg, 15 seconds) and incubated for 5 minutes. Samples were read for luminescence on a ViewLux CCD imager (PerkinElmer). Data were normalized to no-control (0% inhibition) and control Cilnidipine (80 μM; 100%) and the resulting percent inhibition data were fitted to a 4-parameter Hill equation using NCATS in-house software.
Small Colloidally Aggregating Molecule (SCAM) datasets.
Previously, we reported the development of SCAM Detective, a machine learning algorithm and web application (https://scamdetective.mml.unc.edu/) to identify putative SCAMs.16 Since aggregation is the most common type of assay interference in HTS campaigns,24 we reimplemented the SCAM Detective models into Liability Predictor. Details regarding the collection, curation, and integration of two SCAM datasets are available elsewhere.16 In short, HTS aggregation campaigns against β-lactamase and cruzain conducted both in the presence and absence of buffer detergent were identified and downloaded from PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay). SCAMs exhibit false-positive bioactivity (due to colloidal aggregation) that disappears in the presence of detergent.7,25 We employed this experimental observation to define the compounds into putative aggregators and non-aggregators. The curated datasets consist of 272,611 compounds (29,983 putative aggregators and 242,628 non-aggregators) for β-lactamase (containing both AID 485341/485294 and AID 585/584)26 and 187,464 compounds (24,574 putative aggregators and 162,890 non-aggregators) for cruzain (AID 1476/1478).
Overview of the experimental dataset
Data used for QSIR modeling
Experimental data for 5,098 compounds with assay interference via thiol-reactivity, redox activity, and firefly and nano luciferases mechanisms, which were generated as described in the previous section, were employed for computational modeling (Table 2). For FLuc, NLuc, and thiol reactivity models, the activity threshold for interference was characterized as following: if the efficacy was not equal to zero, and the class-curve was equal to −1.1, −1.2, −2.1, or −2.2, the compound was regarded as interfering; otherwise, it was classified as non-interfering. For redox activity, if the efficacy was not equal to zero and the class-curve was equal to 1.1, 1.2, 2.1, or 2.2, the compound was regarded as interfering; otherwise, it was regarded as non-interfering. Descriptions of the class curves can be found in Figure S1.
Table 2.
Dataset overview for all assays.
| Main steps | Thiol reactivity | Redox activity | FLuc | NLuc |
|---|---|---|---|---|
|
| ||||
| Original data | 5,098 | 5,098 | 5,098 | 5,098 |
| Problematic data points (removed) | 7 | 7 | 7 | 7 |
| Duplicates (removed) | 283 | 223 | 186 | 193 |
| Total | 4,808 | 4,868 | 4,905 | 4,898 |
| Interfering (Class 1) | 1,009 | 142 | 118 | 97 |
| Non-interfering (Class 0) | 3,799 | 4,726 | 4,787 | 4,801 |
The original chemical dataset contained seven compounds that RDKit could not read and 184 compounds with replicate entries. The outcomes of most duplicates agreed (cf. Data Curation section). There were two chemicals with disagreeing binary outcomes for FLuc, nine for NLuc, 99 for thiol-reactivity, and 39 for redox activity.27 The final datasets consisted of 4,905 compounds (118 interfering and 4,787 non-interfering) for FLuc, 4,898 compounds (97 interfering and 4,801 non-interfering) for NLuc, 4,808 compounds (1,009 interfering and 3,799 non-interfering) for thiol reactivity, and 4,868 compounds (142 interfering and 4,726 non-interfering) for redox activity. The curated SCAMs datasets consist of 272,611 compounds (29,983 interfering and 242,628 non-interfering) for β-lactamase and 187,464 compounds (24,574 interfering and 162,890 non-interfering) for cruzain. All the curated data are available in the Supplementary Materials.
Data curation
The classified data were curated following the protocol previously developed by our group.28-30 Salts and solvents were stripped from all compounds, and large organic mixtures and inorganic compounds were removed. Chemotypes were standardized using the ChemAxon Standardizer software (https://chemaxon.com/). Compounds with replicate runs in the individual assays of each campaign were analyzed. The final output of replicates was based on majority voting. When the median was equal to 0.5, i.e., complete disagreement, all entries associated with that chemical were removed.
Molecular Descriptors
Whole-molecule RDKit descriptors and ECFP6-like circular fingerprints (Morgan) with 2,048 bits and atom radius of 3 were calculated using RDKit (http://www.rdkit.org) and implemented in KNIME.31 Both descriptor sets were merged to develop the models.
Chemical space analysis of interference data
In this section, we explore the chemical space of current interference data employing two analyses: (i) barycentric coordinates to determine whether compounds interfering in different assays share the same chemical space and (ii) structure-interference relationships (SIR) with comments on the most interesting cases. We plotted the barycentric coordinates of all the unique structures for thiol, redox, FLuc, and NLuc defined by Morgan fingerprints. Barycentric coordinates correspond to the location of the points of a simplex (a triangle, tetrahedron, etc.) in the space, defined by the vertices.32 In this case, a simplex is defined by all the fingerprints of a particular compound. Bar31ycentric coordinates were determined using the Methods of Data Analysis module in the HiT QSAR software.33
Substructure and PAINS searching
The 480 PAINS alerts in SMARTS format were collected from RDKit. PAINS searching was carried out using RDKit after adding hydrogens to query molecules. A compound was considered as interfering by PAINS if one or more PAINS were found in the compound. Frequent subgraph mining was carried out via the gSpan algorithm34 searching for substructures that were present in at least 10% of interfering compounds for a given dataset. Resulting substructures were converted to SMARTS and searched against with the same protocol as PAINS. Substructure enrichment was defined as the total number of matches in interfering compounds / the total number of matches for a given substructure.
QSIR modeling
QSIR models were built following the best practices for model development and validation.35 For model development, we employed the Random Forest (RF) algorithm36 implemented in KNIME.31 Trees were decorrelated by randomly bootstrapping compound instances used in modeling with replacement and selecting a random sample of root(N)-many features for each tree, where N is the total number of features available. Trees were configured to evaluate features on classification accuracy at the median value and to use the information gain ratio as the split criterion. Trees were not pruned.
A 5-fold external cross-validation procedure was employed using the following protocol.37 The full set of compounds with known experimental activity is randomly divided into five subsets of equal size. One of these subsets (20% of all compounds) is set aside as the external validation set, while the remaining four sets form the modeling set (80% of all compounds). This procedure is repeated five times, allowing each of the five subsets to be used as an external validation set.
The applicability domain (AD) of the models was calculated as Dcutoff = <D>+ Zs, where Z is a similarity threshold parameter defined by a user (0.5 in this study) and <D> and s are the average and SD, respectively, of all Euclidian distances in the normalized multidimensional descriptor space between each compound and its nearest neighbors for all compounds in the training set.38 The AD defined the coverage of the models. The following statistical metrics were used to assess different aspects of the model performance (Equations 1–3):
Balanced Accuracy (BA):
| Equation (1) |
Sensitivity (SE):
| Equation (2) |
Specificity (SP):
| Equation (3) |
Virtual profiling of in-house library
For prospective validation of the developed QSIR models, we performed virtual profiling of the NCATS in-house library consisting of 63,941 unique chemicals compounds. We made predictions for the entire library using all the models and, for each assay, we selected 256 compounds, containing 128 hits inside the AD (64 predicted to interfere with the assay and 64 predicted not to interfere) and 128 outside the AD (64 predicted to interfere with the assay and 64 predicted not to interfere).
Experimental validation
Due to the limited availability of physical samples, we were not able to plate all the virtual hit compounds, and the final number of tested compounds for each assay differed. We experimentally tested the following compounds:
216 compounds (102 inside the AD and 114 outside the AD) for thiol-reactivity;
200 compounds (89 inside the AD and 111 outside the AD) for redox activity;
209 compounds (97 inside the AD and 112 outside the AD) for NLuc;
204 compounds (93 inside the AD and 111 outside the AD) for FLuc.
The experimental validation of these compounds was executed following the same protocol used for qHTS data generation (vide infra).
The Liability Predictor web application
The QSIR models developed in this study have been implemented as a web application termed Liability Predictor (https://liabilitypredictor.mml.unc.edu/). The models for SCAMs, previously developed and implemented in the SCAM Detective application (http://scamdetective.mml.unc.edu/), are also included in this new tool. Liability Predictor is encoded using Flask (http://flask.pocoo.org), uWSGI (https://uwsgi-docs.readthedocs.org), Nginx (http://nginx.org), Python 3.9 (https://www.python.org), RDKit (http://www.rdkit.org), scikit-learn (http://scikit-learn.org), and JavaScript (http://www.ecma-international.org). Liability Predictor also includes the JSME molecule editor39 written in JavaScript, which is supported by the most popular web browsers.
Results and discussion
Cheminformatics analysis of inference data
In this work, we generated the largest curated and publicly available dataset of assay interference. We screened 5,098 compounds in four distinct HTS assays for three mechanisms of interference with biological assays (thiol reactivity, redox activity, FLuc, and NLuc). In addition, we incorporated two SCAM datasets.16
Since the compounds in the qHTS dataset generated in this study were tested under four different protocols (two of which were for luciferase), we assessed the overlap of interfering compounds between assays. In this analysis, we considered only the 4,790 compounds that, after curation, had valid datapoints in the four assays (cf. Data Curation section). As shown in Figure 2A, only one compound (Walrycin B/NCGC00371145–02) interfered with all assays. The FLuc and thiol assays exhibited the largest overlap, with 34 of the same interfering compounds. Interestingly, FLuc and NLuc, two orthogonal assays related to luciferase, had a low overlap of only nine compounds. The majority of compounds interfere with only one assay (Figure 2B), further justifying the need to develop assay-specific models.
Figure 2.
Number of overlapping compounds between assays. A) Number of overlapping compounds that interfere at least once. B) Overlap of interfering and non-interfering compounds.
In addition, from this dataset, there were 113 compounds that were also present in the β-lactamase dataset and 93 in the Cruzain dataset. As one can see in Table 3, there is little interference overlap between interfering compounds in these assays. There were 2 (11%), 3 (17%), 2 (11%), and 0 interfering compounds for thiol, redox, NFluc, and NLuc, respectively, that were also SCAMs in β-lactamase (cf. Table S2) and 2 (13%), 1 (0.07%), 2 (13%), and 2 (13%) interfering compounds for thiol, redox, NFluc, and NLuc, respectively, that were also SCAMs in cruzain (cf. Table S3).
Table 3.
Number of overlapping compounds between new interference datasets and SCAMs datasets.
| Beta lactamase | Cruzain | ||||
|---|---|---|---|---|---|
| Aggregator | Non-Aggregator | Aggregator | Non-Aggregator | ||
|
|
|||||
| Thiol | Interfering | 2 | 16 | 2 | 12 |
| Non-interfering | 16 | 79 | 13 | 66 | |
|
|
|||||
| Redox | Interfering | 3 | 6 | 1 | 3 |
| Non-interfering | 15 | 89 | 14 | 75 | |
|
|
|||||
| NFluc | Interfering | 2 | 7 | 2 | 8 |
| Non-interfering | 16 | 88 | 13 | 70 | |
|
|
|||||
| Nluc | Interfering | 0 | 7 | 2 | 4 |
| Non-interfering | 18 | 88 | 13 | 74 | |
The chemical space consisting of the 4,790 tested compounds was analyzed by plotting the barycentric coordinates of all unique structures, which were translated into Morgan fingerprints. The barycentric coordinates better processes dimensionality reduction of data whose decision boundaries are described by a non-linear function than other algorithms for analyzing chemical space. As one can see from
Figure 3A, the interfering compounds (represented in multiple colors and shapes) share the same chemical space with most non-interfering ones (black circles). In Figure 3B we show the same compounds along with a sample of 10,000 representative compounds of the SCAM datasets (5,000 compounds of β-lactamase and 5,000 compounds of cruzain). Since the SCAM datasets were too large, a stratified sample of clusters was taken, selecting 2,500 aggregators and 2,500 non-aggregators in each SCAM dataset. This method was applied to check the structural diversity of compounds in all the datasets and whether compounds from the new interference assay datasets shared similar chemical space with those from the SCAM datasets as well as non-interfering compounds.
Figure 3.
Chemical space of investigated compounds in barycentric coordinates obtained from Morgan fingerprints. A) 4,790 compounds tested in thiol, redox, NFluc, and NLuc and B) these datasets in addition with a sample of 10,000 representative compounds of the SCAM datasets (5,000 compounds of β-lactamase and 5,000 compounds of cruzain).
The analyses presented in Figures 2–3 show that, for the dataset of 4,790 compounds, the compounds tend not to interfere with all the assays (Figure 2). In addition, they share the same chemical space (Figure 3). The sharing of chemical space may explain why generating QSAR models for these data has been challenge and, in addition, why structural alerts, like PAINS, tend to be oversensitive.
SIR analysis
Another important aspect of screening for interference compounds is understanding the mechanism behind the possible interference. While this is a challenging task, one approach is to determine whether there are structural similarities between interfering compounds. This works under the assumption that a substructure present in many compounds implies a consistent mechanism of interference, providing insight into what could be causing the interference.
Starting with the 480 substructures from PAINS, none show significant abundance in the FLuc, NLuc or redox activity datasets. However, three are present in more than 10 compounds out of the 1,009 in the thiol reactivity dataset (Figure 4A). Of those, only catechol showed significant enrichment toward thiol reactivity interference, being present in 34 interfering compounds and in only 1 non-interfering compound, while the other compounds were present in roughly equal amounts between interfering and non-interfering classes. Both catechol and phenylenediamine derivatives may share similar mechanisms as they can form electrophiles to react with nucleophilic free thiol groups. The reactivity may vary depending on the subgroups attached around these moieties. In general, phenylenediamine derivatives are less prone to oxidation compared to catechol derivatives which then can change the distribution amount between interfering and non-interfering classes. In addition, certain indole derivatives with electrophilic warheads or masked electrophilic groups in the form of 5-hydroxy substitution (ex. NCGC00015526) may exert thiol reactivity similar to that of catechols. In certain cases, indole derivatives containing free thiol groups (ex. NCGC00092372) can form disulfide with the substrate. There are some cases where large, complex PAINS substructures are only present in a single interfering and no non-interfering compounds. However, with the size of the substructures being so large and the datasets being so small, it could easily be an artifact of noise in the dataset. Therefore, these also offer little insight into a common mechanism of interference. Looking beyond PAINS, frequent chemical substructures were mined from the list of interfering compounds using frequent subgraph mining. The most frequent substructures obtained via this method were basic building blocks of drug-like compounds: chains of carbon, alcohols, amines, and rings. This result was expected, and these substructures were ignored (Figure 4B). Unsurprisingly, these substructures are not enriched towards interfering compounds and are therefore not useful in determining an interference mechanism. Catechol a is the only datapoint that has a significant enrichment and occurrence rate and thus may provide insight into a mechanism. However, it is only present in around 3% of the interfering compounds in the thiol reactivity dataset, so any investigated mechanisms would not be common enough to help with the task of predicting interference for all chemicals. From this we conclude that any “interpretability” of substructure alerts is likely to be weak, if present at all, and sacrificing performance to get such interpretability is not a desirable tradeoff. A more extensive study might involve the investigation of combinations of smaller frequent substructures to explain interference, at the risk of being more complicated to extract a mechanism, but we leave this experiment for future work.
Figure 4.
Examples of substructures from the thiol activity dataset. A) The three PAINS substructures were found to occur in more than 1% of interfering compounds. B) Three examples of frequent subgraphs. “Found” describes the number of compounds that contain the given substructure, “support” is the percentage of interfering compounds that contain that substructure, and enrichment is the number of confirmed interfering hits over the number of total hits. An enrichment of 0.5 implies that substructure is equally as frequent in non-interfering as in interfering compounds.
QSIR modeling and experimental validation
The statistical characteristics for the FLuc, NLuc, thiol reactivity, and redox activity interference models and experimental validation are shown in Tables 2–5. Since the data for all endpoints except thiol reactivity were extremely imbalanced, we employed a multi-under sampling approach. We split the non-interfering class, which was larger, into multiple folds of the same size as the interfering class. Each model was developed using 5-fold external cross-validation, and the remaining non-interfering compounds were used as a test set for validation. Results for the models are shown with a calculated standard deviation. All models except redox activity had high external predictive power when evaluated by both 5-fold external cross-validation and the external sets kept aside during each one of the multi-under sampling rounds. In addition, models were submitted to 20 rounds of y-randomization to guarantee the models’ predictivity were not due by chance. Lastly, all models were validated experimentally, presenting similar or better BA (cf. Tables 4–7). As described in Materials and Methods, some selected compounds were not available as physical samples, so the final number of tested compounds was slightly smaller than the number of virtual hit compounds selected for testing.
Table 5.
Statistical characteristics of 27 multi-under sampling QSIR models developed for redox activity using 5-fold external cross-validation. Experimental validation was performed with 89 compounds within the AD of the model and 111 outside the AD.
| Model/Assay | BA | SE | SP |
|---|---|---|---|
|
| |||
| Redox activity (5-fold) | 0.62 | 0.55 | 0.70 |
| Redox activity (ext.)* | N/A | N/A | 0.70 |
| Redox activity (exp. validation AD) | 0.73 | 0.88 | 0.58 |
| Redox activity (exp. validation) | 0.78 | 1.00 | 0.55 |
*Remaining data after balancing in multi-undersampling approach. The data contains only non-interfering compounds.
Table 4.
Statistical characteristics of the balanced QSIR model developed for thiol reactivity following a 5-fold external cross-validation. Experimental validation was performed with 102 compounds within the AD of the model and 114 outside the AD.
| Model/Assay | BA | SE | SP |
|---|---|---|---|
|
| |||
| Thiol reactivity (5-fold) | 0.70 | 0.66 | 0.74 |
| Thiol reactivity (ext.)* | N/A | N/A | 0.84 |
| Thiol reactivity (exp. Validation AD) | 0.78 | 0.90 | 0.66 |
| Thiol reactivity (exp. Validation) | 0.64 | 0.71 | 0.56 |
*Remaining data after balancing in multi-undersampling approach. The data contains only non-interfering compounds.
Table 7.
Statistical characteristics of 47 multi-undersampling QSIR models developed for NLuc following a 5-fold external cross-validation. Experimental validation was performed with 97 compounds within the AD of the model and 112 outside the AD.
| Model/Assay | BA | SE | SP |
|---|---|---|---|
|
| |||
| NLuc (5-fold) | 0.75 | 0.87 | 0.63 |
| NLuc (ext.) | N/A | N/A | 0.63 |
| NLuc (exp. Validation AD) | 0.58 | 0.75 | 0.41 |
| NLuc (exp. Validation) | 0.37 | 0.29 | 0.46 |
*Remaining data after balancing in multi-undersampling approach. The data contains only non-interfering compounds.
Thiol reactivity modeling and experimental validation
We did not employ the multi-under sampling approach for the thiol reactivity dataset as we did with the others, as the ratio between the two class sizes was 1:3. Instead, we followed a chemically rational under-sampling approach as previously described elsewhere.40 This model presents a BA = 70%, SE = 66%, and SP = 74% (see Table 4). Experimental validation demonstrated that compounds within the AD were predicted with a higher BA at 78% and SE at 90%.
Redox-activity modeling and experimental validation
The RCC dataset was challenging to model, and our best models had SE of 55%. However, since the BA was 62% and the SP was 70% (cf. Table 5), we proceeded with the experimental validation of these models. The models built using the multi-under sampling balancing approach demonstrated low SE, which can be explained by the dataset only containing a small number of interfering compounds. Experimental validation of 89 compounds within AD showed that these models had a greater BA by ~11%. The SP was lower at 58% compared to the original 70%, but the SE increased to 88%.
FLuc modeling and experimental validation
The statistical characteristics for the FLuc models are shown in Table 6. The 5-fold CV model displayed BA = 78%, SE = 86%, and SP = 70%. The multi-undersampling approach allowed us to develop 38 predictive models without losing any information from the much larger, non-interfering class. The experimental validation for FLuc presented a similar BA. The SE of the experimental validation was 100%, which was higher than the 5-fold CV’s 86%. In contrast, the SP of the experimental validation was 44%, which was lower than the 5-fold CV’s 70%.
Table 6.
Statistical characteristics of 38 multi-undersampling QSIR models developed for FLuc following a 5-fold external cross-validation. Experimental validation was performed with 93 compounds within the AD of the model and 111 outside the AD.
| Model/Assay | BA | SE | SP |
|---|---|---|---|
|
| |||
| FLuc (5-fold) | 0.78 | 0.86 | 0.70 |
| FLuc (ext.)* | N/A | N/A | 0.72 |
| FLuc (exp. Validation AD) | 0.72 | 0.67 | 0.52 |
| FLuc (exp. Validation) | 0.59 | 1.00 | 0.44 |
*Remaining data after balancing in multi-undersampling approach. The data contains only non-interfering compounds.
NLuc modeling and experimental validation
The NLuc QSIR model displayed a 5-fold CV BA = 0.75 (Table 7) and was the only endpoint with an experimental validation BA lower than 60%. The training set for this endpoint had a smaller number of interfering compounds (n = 97). Although the multi-undersampling approach with 47 models showed high accuracy (BA = 75%, SE = 87%, and SP = 63%), the models did not generalize well for the experimental data, with an average BA = 58%.
SCAMs modeling
Previously, we reported the development of QSAR models to predict SCAMs. 16 These models were implemented in a web application termed SCAM Detective (https://scamdetective.mml.unc.edu/). This web application will be maintained, but to provide a user with a comprehensive assay liability application, we reimplemented these models in the Liability Predictor along with the other four models described in this manuscript. The SCAM Detective models showed balanced accuracies as high as 66–77% to distinguish aggregators from non-aggregators. The complete details about these models are available elsewhere.16
Comparison of QSIR models and PAINS
To compare our QSIR models to structural alert predictions based on PAINS, we used PAINS fragments to screen the all the datasets reported in this study as well as the much larger datasets for β-lactamase and cruzain assays used in SCAM Detective.16 The performance was consistent across all six models, except for BA of thiol reactivity (Table 8). Most significantly, they all have a SP near 97% and a SE of ~7%, resulting in balanced accuracies of around 52%. Thus, PAINS filters correctly classified most non-interfering compounds but failed to detect 90% or more of truly interfering compounds. Virtual liability screening aims to detect as many truly interfering compounds as possible while minimizing the number of falsely detected artifacts. For the assays utilized in this study, PAINS failed to carry out both tasks successfully. Meanwhile, our QSIR models showed better performance, as their higher SE (achieved at the expense of SP) suggests they can detect around 55% – 80% of interfering compounds.
Table 8.
BA of PAINS alerts vs. QSIR external performance (5-fold external cross validation). Bolded entries are at least 0.1 higher than their counterpart.
| BA 5-fold | BA test set | |||
|---|---|---|---|---|
|
| ||||
| Assay | PAINS | QSIR | PAINS | QSIR |
| Beta Lactamase | 0.50 | 0.72 | 0.94 | 0.73 |
| Cruzain | 0.51 | 0.70 | 0.95 | 0.69 |
| Thiol reactivity | 0.55 | 0.70 | 0.70 | 0.78 |
| Redox activity | 0.52 | 0.62 | 0.55 | 0.73 |
| FLuc | 0.51 | 0.78 | 052 | 0.72 |
| NLuc | 0.51 | 0.75 | 0.53 | 0.58 |
Challenges and limitations in modeling interference assays
A major limitation to using structural alerts, like PAINS, for assay interference screening is their “one size fits all” approach. Various assays often have different mechanisms that result in interference, and thus a compound containing a substructure which might cause interference in one assay will not necessarily cause interference in all others (Figure 2). Additionally, the 480 substructures in the PAINS dataset do not span all assays and could be biased towards a few selected assays. The benefit of QSIR is that the models are assay specific.
Experimental validation demonstrated that our models detected interference superiorly to PAINS. However, while an improvement on PAINS, the QSIR models still have flaws. Like PAINS, the QSIR models overpredict compounds as interfering, which could preclude further development of promising candidates. QSIR still has an advantage here though, as with secondary screening of flagged compounds it is possible to reduce the number of false positives. It is much harder to fix the false negatives that PAINS suffer from without rescreening most of the original library with secondary assays. If false positives are not a concern, secondary screening of flagged compounds can be skipped. However, in general, it is good practice to maximize performance of interference screening, no matter the approach. As such, all flagged compounds should be further investigated for assay interference to help reduce the false positive rate. Any orthogonal assay meant to detect the interface in question can be used to carry out this secondary screening on flagged compounds. Any compounds being pursued beyond initial screening should be screened for interference regardless of model prediction. QSIR models, and interference prediction in general, are most useful in cases where detecting interference is a higher priority than missing true hits, as is the case for most high throughput early-stage discovery campaigns. The QSIR models developed here are more effective than PAINS in accomplishing this goal.
Interpretability may be seen as a major benefit of substructural approaches, like PAINS, over machine learning approaches, like QSIR. As stated earlier, structural alerts imply a mechanism between the structure of the compounds and its interference. No such implication is guaranteed to be present in the QSIR models developed by our group, or in the machine learning models developed by other groups that predict assay interference potential. While these models have shown better performance in classifying interference, this comes at the cost of interpretation. However, as discussed earlier, our results herein and in our previous work18 suggest that the possible interpretation that can come out of structural alert methods is typically weak. Thus, the true cost of losing “interpretability” by moving away from structural alerts and towards QSIR models is quite low, as all methods inherently struggle with any interpretability. However, some interference modeling approaches like Luciferase Advisor,20 SCAM Detective,15 and InterPred21 attempt to implement approaches that allow the prediction of specific interference mechanisms, possibly nullifying the advantage of PAINS event further. However, methods to make QSIR models more interpretable are difficult to implement and often lack robust experimental study and validation. Consequently, models like Hit Dexter 2.041 and those from this paper tend not to make predictions about specific mechanisms of assay interference. Developing methods that retain high classification performance and are also interpretable is an important next step for QSIR model development and for interference prediction in general.
Despite all the efforts to develop models and tools, predicting assay interference remains a difficult task. To the best of our knowledge, none of the tools published in the literature were validated in prospective experimental studies. The QSIR models we developed, except for NLuc, showed good accuracy during experimental validation and maintained high SE. AD reduced the coverage but allowed to increase the BA of the models by 5–21%. These results highlight the importance of defining the AD of QSAR models, which are corroborated by previous studies.42
The HTS interference datasets used in this study were difficult to model, both due to the small number of interfering compounds and the nature of interference itself. For example, thiol and redox activities are nonspecific and compounds with diverse chemical structures (see the previous section) can interfere with these mechanisms. Furthermore, luciferase binding also seems promiscuous. Despite these challenges, the models developed and validated in this study offer the best current solution for interference prediction and should be used instead of alert-based approaches, such as PAINS filters which are far more likely to miss interfering compounds. Our results show that, although not all the models pass the threshold of acceptable statistical characteristics (>= 60%), they show high experimental BA and SE and therefore can be employed to identify non-interfering compounds (use of AD will increase the confidence of the prediction) with state-of-the-art reliability.
Liability Predictor implementation and usage
The user-friendly Liability Predictor web application implements the externally validated QSIR models developed in this study, as well as the QSIR models of SCAMs that we previously reported, allowing for the identification of assay interfering and non-interfering compounds.16 The Web app does not require that the user have computational or programming skills. It provides binary classification predictions (interfering vs. non-interfering) for thiol reactivity, redox activity, FLuc and NLuc inhibition, and the SCAM profile for β-lactamase and cruzain activity. In addition, the user could visualize a map illustrating the contribution of chemical fragments to the predicted interference profile is shown.43 The chemical fragments predicted to reduce interference profile are green, and those predicted to increase interference profile are magenta. The gray isolines separate positive and negative contributions. We successfully implemented this technology in our previous studies.44–47 To submit predictions, the user can either draw a molecule of interest in the JSME Molecule Editor box39 or paste the SMILES string into the appropriate area. The source code for the web app is available at https://github.com/jimmyjbling/LiabilityPredictor.
Conclusions
The root cause(s) of false positives, especially for nonspecific thiol reactivity and redox activity, remain undiscovered and may continue to plague drug discovery efforts well into the future. Promiscuity is certainly a hallmark of assay interference, and HTS hits predicted as assay-interfering compounds should be scrutinized as potential false positives. Developing computational models based on mechanisms of assay interference remains a critical challenge; however, mechanistic insight may be crucial for assay design, hit-to-lead optimization, and go/no-go decisions. Here, with experimentally validated QSIR models on thiol, redox, and luciferase interference, we have demonstrated the utility of predicting potential interference compounds based on their mechanism of interference, rather than the “all-or-none” approach taken by PAINS-like structural alerts. All the models and curated datasets developed in this study are implemented as free Liability Predictor Web-Portal and are publicly available at https://liability.mml.unc.edu/.
Supplementary Material
Acknowledgments
The UNC team was supported in part by the NIH National Institute of General Medical Sciences of the National Institutes of Health under Award Numbers R01GM140154 and T32GM135122 and National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Number U19AI171292, and the NCATS team was supported by the Intramural Research Program of the National Center for Advancing Translational Sciences (NCATS), NIH. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. VA is currently an employee at Takeda Pharmaceuticals, San Diego, CA. S.C. is currently an employee at Vertex Pharmaceuticals, Boston, MA.
Abbreviations
- AD
applicability domain
- BA
Balanced Accuracy
- CV
Cross validation
- DSF
Differential, Scanning Fluorimetry
- ECFP
Extended connectivity fingerprints
- Fluc
Firefly Luciferase
- HTS
High-throughput screening
- NLuc
Nanoluciferase
- PAINS
Pan-Assay INterference compoundS
- QSIR
Quantitative Structure-Interference Relationship
- RCCs
redox cycling compounds
- SE
Sensitivity
- SP
Specificity
- TRCs
thiol-reactive compounds
Footnotes
Conflicts of Interest
AT, VMA, and ENM are co-founders of Predictive, LLC, which develops computational methodologies and software for toxicity prediction. All other authors declare they have nothing to disclose.
Associated Content
Supporting information
Description of NCATS curves class (supplementary_figures), PubChem assay IDs for generated data, SCAM liabilities for interfering compounds (supplementary_tables). Curated SDF datasets of the training data and experimental results are also provided in dataset folder of Assay_liabilities_Supplementary_Info.zip.
References
- (1).Macarron R; Banks MN; Bojanic D; Burns DJ; Cirovic DA; Garyantes T; Green DVS; Hertzberg RP; Janzen WP; Paslay JW; Schopfer U; Sittampalam GS Impact of High-Throughput Screening in Biomedical Research. Nat Rev Drug Discov 2011, 10 (3), 188–195. 10.1038/nrd3368. [DOI] [PubMed] [Google Scholar]
- (2).Thorne N; Auld DS; Inglese J. Apparent Activity in High-Throughput Screening: Origins of Compound-Dependent Assay Interference. Curr Opin Chem Biol 2010, 14 (3), 315–324. 10.1016/j.cbpa.2010.03.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (3).Yang Z-Y; He J-H; Lu A-P; Hou T-J; Cao D-S Frequent Hitters: Nuisance Artifacts in High-Throughput Screening. Drug Discov Today 2020, 25 (4), 657–667. 10.1016/j.drudis.2020.01.014. [DOI] [PubMed] [Google Scholar]
- (4).Simeonov A; Jadhav A; Thomas CJ; Wang Y; Huang R; Southall NT; Shinn P; Smith J; Austin CP; Auld DS; Inglese J. Fluorescence Spectroscopic Profiling of Compound Libraries. J Med Chem 2008, 51 (8), 2363–2371. 10.1021/jm701301m. [DOI] [PubMed] [Google Scholar]
- (5).McCallum MM; Nandhikonda P; Temmer JJ; Eyermann C; Simeonov A; Jadhav A; Yasgar A; Maloney D; Arnold A. (Leggy). High-Throughput Identification of Promiscuous Inhibitors from Screening Libraries with the Use of a Thiol-Containing Fluorescent Probe. J Biomol Screen 2013, 18 (6), 705–713. 10.1177/1087057113476090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Coussens NP; Auld D; Roby P; Walsh J; Baell JB; Kales S; Hadian K; Dahlin JL Compound-Mediated Assay Interferences in Homogeneous Proximity Assays. Assay Guidance Manual. https://www.ncbi.nlm.nih.gov/books/NBK553584/ (accessed 2022-04-26). [Google Scholar]
- (7).Dahlin JL; Baell J; Walters MA Assay Interference by Chemical Reactivity. Assay Guidance Manual [Internet]. http://www.ncbi.nlm.nih.gov/pubmed/26561694 (accessed 2022-04-28). [Google Scholar]
- (8).Dahlin JL; Nissink JWM; Strasser JM; Francis S; Higgins L; Zhou H; Zhang Z; Walters MA PAINS in the Assay: Chemical Mechanisms of Assay Interference and Promiscuous Enzymatic Inhibition Observed during a Sulfhydryl-Scavenging HTS. J Med Chem 2015, 58 (5), 2091–2113. 10.1021/jm5019093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (9).Golz S. Reporter Genes in Cell Based Ultra High Throughput Screening. In Bioengineering in Cell and Tissue Research; Artmann, G. M., Chien S, Eds.; Springer Berlin; Heidelberg: Berlin, Heidelberg, 2008; pp 3–22. 10.1007/978-3-540-75409-1. [DOI] [Google Scholar]
- (10).Fan F; Wood KV Bioluminescent Assays for High-Throughput Screening. Assay Drug Dev Technol 2007, 5 (1), 127–136. 10.1089/adt.2006.053. [DOI] [PubMed] [Google Scholar]
- (11).Yonchev D; Bajorath J. Inhibitor Bias in Luciferase-Based Luminescence Assays. Future Sci OA 2020, 6 (8), FSO594. 10.2144/fsoa-2020-0081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (12).McGovern SL; Helfand BT; Feng B; Shoichet BK A Specific Mechanism of Nonspecific Inhibition. J Med Chem 2003, 46 (20), 4265–4272. 10.1021/jm030266r. [DOI] [PubMed] [Google Scholar]
- (13).Reker D; Bernardes GJL; Rodrigues T. Computational Advances in Combating Colloidal Aggregation in Drug Discovery. Nat Chem 2019, 11 (5), 402–418. 10.1038/s41557-019-0234-9. [DOI] [PubMed] [Google Scholar]
- (14).Duan D; Torosyan H; Elnatan D; McLaughlin CK; Logie J; Shoichet MS; Agard DA; Shoichet BK Internal Structure and Preferential Protein Binding of Colloidal Aggregates. ACS Chem Biol 2017, 12 (1), 282–290. 10.1021/acschembio.6b00791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (15).Thorne N; Auld DS; Inglese J. Apparent Activity in High-Throughput Screening: Origins of Compound-Dependent Assay Interference. Curr Opin Chem Biol 2010, 14 (3), 315–324. 10.1016/j.cbpa.2010.03.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (16).Alves VM; Capuzzi SJ; Braga RC; Korn D; Hochuli JE; Bowler KH; Yasgar A; Rai G; Simeonov A; Muratov EN; Zakharov A. v.; Tropsha A. SCAM Detective: Accurate Predictor of Small, Colloidally Aggregating Molecules. J Chem Inf Model 2020, 60 (8), 4056–4063. 10.1021/acs.jcim.0c00415. [DOI] [PubMed] [Google Scholar]
- (17).Baell JB; Holloway GA New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J Med Chem 2010, 53 (7), 2719–2740. 10.1021/jm901137j. [DOI] [PubMed] [Google Scholar]
- (18).Capuzzi SJ; Muratov EN; Tropsha A. Phantom PAINS: Problems with the Utility of Alerts for P an- A Ssay IN Terference Compound S. J Chem Inf Model 2017, 57 (3), 417–427. 10.1021/acs.jcim.6b00465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (19).Senger MR; Fraga CAM; Dantas RF; Silva FP Filtering Promiscuous Compounds in Early Drug Discovery: Is It a Good Idea? Drug Discov Today 2016, 00 (00), 1–5. 10.1016/j.drudis.2016.02.004. [DOI] [PubMed] [Google Scholar]
- (20).Alves VM; Muratov EN; Capuzzi SJ; Politi R; Low Y; Braga RC; Zakharov A. v.; Sedykh A; Mokshyna E; Farag S; Andrade CH; Kuz’min VE; Fourches D; Tropsha A. Alarms about Structural Alerts. Green Chem 2016, 18 (16), 4348–4360. 10.1039/C6GC01492E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (21).Ghosh D; Koch U; Hadian K; Sattler M; Tetko IV Luciferase Advisor: High-Accuracy Model To Flag False Positive Hits in Luciferase HTS Assays. J Chem Inf Model 2018, 58 (5), 933–942. 10.1021/acs.jcim.7b00574. [DOI] [PubMed] [Google Scholar]
- (22).Borrel A; Mansouri K; Nolte S; Saddler T; Conway M; Schmitt C; Kleinstreuer NC InterPred: A Webtool to Predict Chemical Autofluorescence and Luminescence Interference. Nucleic Acids Res 2020, Just accep. 10.1093/nar/gkaa378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (23).NCATS. NPACT Chemical Library — Innovative Chemical Biology Library for Translational Sciences | National Center for Advancing Translational Sciences. https://ncats.nih.gov/preclinical/core/compound/npact (accessed 2022-03-31).
- (24).Duan D; Torosyan H; Elnatan D; McLaughlin CK; Logie J; Shoichet MS; Agard DA; Shoichet BK Internal Structure and Preferential Protein Binding of Colloidal Aggregates. ACS Chem Biol 2017, 12 (1), 282–290. 10.1021/acschembio.6b00791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (25).Auld DS; Inglese J; Dahlin JL Assay Interference by Aggregation. Assay Guidance Manual. https://www.ncbi.nlm.nih.gov/books/NBK442297/. [Google Scholar]
- (26).Feng BY; Simeonov A; Jadhav A; Babaoglu K; Inglese J; Shoichet BK; Austin CP A High-Throughput Screen for Aggregation-Based Inhibition in a Large Compound Library. J Med Chem 2007, 50 (10), 2385–2390. 10.1021/jm061317y. [DOI] [PubMed] [Google Scholar]
- (27).Jadhav A; Ferreira RS; Klumpp C; Mott BT; Austin CP; Inglese J; Thomas CJ; Maloney DJ; Shoichet BK; Simeonov A. Quantitative Analyses of Aggregation, Autofluorescence, and Reactivity Artifacts in a Screen for Inhibitors of a Thiol Protease. J Med Chem 2010, 53 (1), 37–51. 10.1021/jm901070c. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (28).Fourches D; Muratov E; Tropsha A. Curation of Chemogenomics Data. Nat Chem Biol 2015, 11 (8), 535–535. 10.1038/nchembio.1881. [DOI] [PubMed] [Google Scholar]
- (29).Fourches D; Muratov E; Tropsha A. Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J Chem Inf Model 2010, 50 (7), 1189–1204. 10.1021/ci100176x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (30).Fourches D; Muratov E; Tropsha A. Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation. J Chem Inf Model 2016, 56 (7), 1243–1252. 10.1021/acs.jcim.6b00129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (31).Berthold MR; Cebron N; Dill F; Gabriel TR; Kötter T; Meinl T; Ohl P; Sieb C; Thiel K; Wiswedel B. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization; Gaul W, Vichi M, Weihs C, Eds.; Springer, Berlin, Heidelberg, 2008; pp 319–326. 10.1007/978-3-540-78246-9_38. [DOI] [Google Scholar]
- (32).Vityuk N; Voskresenskaja E; Kuz’min V. The Synergism of Methods Barycentric Coordinates and Trend-Vector for Solution ―Structure-Property Tasks. Pattern Recognition and Image Analysis 1999, 3, 521–528. [Google Scholar]
- (33).Kuz’min VE; Artemenko AG; Muratov EN Hierarchical QSAR Technology Based on the Simplex Representation of Molecular Structure. J Comput Aided Mol Des 2008, 22 (6–7), 403–421. 10.1007/s10822-008-9179-6. [DOI] [PubMed] [Google Scholar]
- (34).Yan X; Han J. GSpan: Graph-Based Substructure Pattern Mining. Proceedings - IEEE International Conference on Data Mining, ICDM; 2002, 721–724. 10.1109/ICDM.2002.1184038. [DOI] [Google Scholar]
- (35).Tropsha A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol Inform 2010, 29 (6–7), 476–488. 10.1002/minf.201000061. [DOI] [PubMed] [Google Scholar]
- (36).Breiman L. Random Forests. Mach Learn 2001, 45 (1), 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
- (37).Tropsha A; Gramatica P; Gombar V. The Importance of Being Earnest: Validation Is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb Sci 2003, 22 (1), 69–77. 10.1002/qsar.200390007. [DOI] [Google Scholar]
- (38).Golbraikh A; Shen M; Xiao Z; Xiao Y-D; Lee K-H; Tropsha A. Rational Selection of Training and Test Sets for the Development of Validated QSAR Models. J Comput Aided Mol Des 2003, 17 (2–4), 241–253. 10.1023/A:1025386326946. [DOI] [PubMed] [Google Scholar]
- (39).Bienfait B; Ertl P. JSME: A Free Molecule Editor in JavaScript. J Cheminform 2013, 5 (5), 24. 10.1186/1758-2946-5-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (40).Alves VM; Muratov E; Fourches D; Strickland J; Kleinstreuer N; Andrade CH; Tropsha A. Predicting Chemically-Induced Skin Reactions. Part I: QSAR Models of Skin Sensitization and Their Application to Identify Potentially Hazardous Compounds. Toxicol Appl Pharmacol 2015, 284 (2), 262–272. 10.1016/j.taap.2014.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (41).Stork C; Chen Y; Šícho M; Kirchmair J. Hit Dexter 2.0: Machine-Learning Models for the Prediction of Frequent Hitters. J Chem Inf Model 2019, 59 (3), 1030–1043. 10.1021/acs.jcim.8b00677. [DOI] [PubMed] [Google Scholar]
- (42).Alves VM; Muratov EN; Zakharov A; Muratov NN; Andrade CH; Tropsha A. Chemical Toxicity Prediction for Major Classes of Industrial Chemicals: Is It Possible to Develop Universal Models Covering Cosmetics, Drugs, and Pesticides? Food and Chemical Toxicology 2018, 112, 526–534. 10.1016/j.fct.2017.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (43).Riniker S; Landrum G. a. Similarity Maps - A Visualization Strategy for Molecular Fingerprints and Machine-Learning Methods. J Cheminform 2013, 5 (9), 43. 10.1186/1758-2946-5-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (44).Kuz’min VE; Muratov EN; Artemenko AG; Gorb L; Qasim M; Leszczynski J. The Effects of Characteristics of Substituents on Toxicity of the Nitroaromatics: HiT QSAR Study. J Comput Aided Mol Des 2008, 22 (10), 747–759. 10.1007/s10822-008-9211-x. [DOI] [PubMed] [Google Scholar]
- (45).Capuzzi SJ; Kim IS-J; Lam WI; Thornton TE; Muratov EN; Pozefsky D; Tropsha A. Chembench: A Publicly Accessible, Integrated Cheminformatics Portal. J Chem Inf Model 2017, 57 (2), 105–108. 10.1021/acs.jcim.6b00462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (46).Kuz’min VE; Muratov EN; Artemenko AG; Varlamova EV; Gorb L; Wang J; Leszczynski J. Consensus QSAR Modeling of Phosphor-Containing Chiral AChE Inhibitors. QSAR Comb Sci 2009, 28 (6–7), 664–677. 10.1002/qsar.200860117. [DOI] [Google Scholar]
- (47).Melo-Filho CC; Dantas RF; Braga RC; Neves BJ; Senger MR; Valente WCG; Rezende-Neto JM; Chaves WT; Muratov EN; Paveley RA; Furnham N; Kamentsky L; Carpenter AE; Silva-Junior FP; Andrade CH QSAR-Driven Discovery of Novel Chemical Scaffolds Active against Schistosoma Mansoni. J Chem Inf Model 2016, 56 (7), 1357–1372. 10.1021/acs.jcim.6b00055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




