Abstract
Gene expression and cell morphology data are high-dimensional biological readouts of much recent interest for drug discovery. They are able to describe biological systems in different states (e.g., healthy and diseased), as well as biological systems before and after compound treatment, and they are hence useful for matching both spaces (e.g., for drug repurposing) as well as for characterizing compounds with respect to efficacy and safety endpoints. This Microperspective describes recent advances in this direction with a focus on applied drug discovery and drug repurposing, as well as outlining what else is needed to advance further, with a particular focus on better understanding the applicability domain of readouts and their relevance for decision making, which is currently often still unclear.
Keywords: drug discovery, transcriptomics, gene expression, cell painting, cell morphology, assay predictivity
Gene expression and cell morphology are high-dimensional biological readouts of much recent interest for drug discovery.1−3 One major reason for this is an unsatisfactory situation with respect to in vivo-relevant and translational early-stage assays, with clinical predictivity for efficacy and safety endpoints.4−7 At the same time, technological advancements in assay technologies, from Sanger sequencing to microarrays and more recently RNA-Seq and derived technologies such as DRUG-Seq,8 TempO-Seq,9 and others, have made data generation of gene expression data much more facile on a large scale. On the cell morphology side, assays such as Cell Painting10 have achieved a similar state of progress (e.g., related to relative assay standardization and data analysis methods) ca. 10–15 years later.
The potential (but to be proven) advantage of the above technologies from the practical side is their general applicability—given that biological readouts are generated on a “systems” scale (with this term being admittedly somewhat undefined), readouts obtained are potentially of interest for a wide variety of efficacy- and safety-related in vivo endpoints. However, such systems-wide readouts are generally obtained in a hypothesis-free (as opposed to hypothesis-driven) manner, hence describing general variance between systems instead of biological variables and endpoints of particular interest that have been defined from the onset. In practice, this means that the predictive validity4,5 of such wide endpoints is not defined ab initio, leading to the need to validate the predictive utility of a readout in a practically relevant drug discovery setting.
This Microperspective will give examples of the above readout domains, gene expression and cell morphology, and their recent applications in the drug discovery context. The following text will be structured by readout type, describing first gene expression readouts and then cell morphology readouts. Within each section, there will be first a brief discussion about technology developments of each readout type, followed by sections on efficacy and repurposing, safety, and mode of action analysis of compounds.11
Some general requirements for omics data to be used for efficacy- and safety-relevant decision making are summarized in a generic way in Table 1, with details depending on each particular case in practice.
Table 1. Requirements for Biological Endpoint Data to Make Practical Impact in Drug Discovery Projects.
| Endpoint property | Minimum threshold | Best-case scenario |
|---|---|---|
| Reproducibility | Intergroup variance is larger than intragroup variance | Consistently reproducible readout |
| Predictivity of signal to in vivo endpoints | Predictivity for subgroup of population | Predictivity across whole population (between and within subtypes) |
| Causality | Correlation-based marker, no causality | Causality present |
| Wide applicability/necessity of tailoring | Necessity of tailoring of assay/data processing setup for different in vivo endpoints | No tailoring needed for different in vivo endpoints |
| Standardized data processing pipeline | Data processing pipeline sufficiently standardized for decision making in some areas | Data processing pipeline standardized across all endpoints |
| Interpretability/allows inclusion of prior information | No inclusion of prior information possible | Inclusion of all desired prior information possible (e.g., pathway information) |
| Speed, cost | Practically feasible cost-wise for some projects, in plausible timeframes | Competitive in cost and time domain across all projects |
Transcriptomics Data
Technology
In more recent history of gene expression profiling, microarrays established themselves in the early 2000s as a sufficiently low-cost and high-enough-throughput tool,12,13 which has resulted in viable commercial offerings such as the Affymetrix gene expression chips in particular. However, for profiling large numbers of compounds, costs were still prohibitive, which led to several advances, such as the L1000 platform,14 which provided again a decrease in cost by about 2 orders of magnitude, although it only provided information on the expression of around 1000 predefined genes. A scientifically separate route taken somewhat later in time involved targeted technologies often based on RNA-Seq, which usually involve simplified sample preparation (not requiring RNA isolation) and multiplexing, and which were generally amenable to automation. These technologies include RASL-Seq,15 and more recently also TempO-Seq16 and DRUG-Seq.8 Different recent methods have different trade-offs. For example, PLATE-seq17 allows samples in 96-well plates to be profiled in high throughput; however, it requires lengthy RNA purification steps, and the 96-well format is not sufficiently high-throughput for many practical situations. Currently with techniques such as DRUG-Seq, profiling costs per sample (including sample preparation and sequencing) on the order of dollars can be achieved, assuming the setup is in place, allowing also high-throughput automation.8 Developments in technologies in some cases went hand-in-hand with the availability of public data, such as in the form of the Connectivity Map18 and L1000 data sets.14 However, for RNA-Seq data, until the current time, no such data set is available in the public domain.
New methods, of course, need to be validated, both in the absolute sense (whether the readout obtained is of use for the intended decision to be made in a drug discovery project) as well as in a relative sense (whether a new technology agrees with previous technologies), and given the pace with which technology develops, this is in practice often a difficult aim. This is an immense pain point in the pharmaceutical context, both in the fundamental sense (“Which readout now gives me a predictive signal?” 6,7) as well as the practical sense (in order to establish databases and workflows in a consistent manner, without the requirement to establish processes, and comparisons to existing data, every few years).
Many of the above technologies have hence been subject to extensive validation studies. The publication establishing DRUG-Seq8 concluded, based on a data set of 433 compounds, that transcription profiles successfully grouped compounds into mechanism-of-action clusters based on their intended targets, although in this analysis, mechanism-of-action was not defined on a selective on-target basis (which is also often too simplified a view of compound action) but rather a broader modulation of biological function category. A subsequent study on the same technology19 used an open-source analysis pipeline which also shows “high reproducibility and ability to resolve the mechanism(s) of action for a diverse set of compounds”, and which was applied to NMDA receptor modulators. It would have been interesting to show in that study whether clinically efficacious NMDA modulators (such as ketamine) could be distinguished from those that turn out to be not efficacious in the clinic, which would be the holy grail of such readouts.
Some comparisons have been used in the context of safety assessment, such as for the New Approach Methodologies (NAMs; i.e., non-animal)-based hazard characterization of environmental chemicals using TempO-Seq in an MCF7 cell model.20 The authors conclude that data was reproducible, that aggregating genes into gene signatures is beneficial, and that this was the case in particular for concentration–response analyses, where previous reference data could be reproduced (i.e., the pathway gene expression signal corresponded to the most sensitive on-target assay activity from previous profiling). Including concentration is probably where safety is currently more advanced than when it comes to efficacy. TempO-Seq was validated in another study,16 which used 45 purified RNA samples from the livers of rats exposed to chemicals that fall into one of five modes of action, where it was found that microarrays and TempO-Seq capture the most variability in terms of mode of action, while here RNA-Seq had higher noise and larger differences between samples within a mode of action (which, however, is not bad, given that there is an individual aspect to compound pharmacology, beyond any annotated, labeled mode of action).
From the practical data analysis side, taking batch effects into account is crucial, since due to both biological variation and slight variation in assay protocol results may differ. Different methods have been compared recently,21 finding that the limma method correcting for two principal components showed the best performance.
More recent technologies include single-cell as well as spatially and time-resolved transcriptomics.22−25 Those technologies have led to pipelines for, e.g., drug repurposing based on single-cell data,26 a better understanding of the tumor microenvironment in lung cancer,27 and insights into early development of human embryos28 as well as other organisms. However, in addition to high cost and low throughput, there is often no clear path to practical deployment of many of those readouts in a decision-making context, since data generation, processing, and decision making have not yet converged on standardized pipelines to this end.
Efficacy/Repurposing
Gene expression data has been used for a number of years, on a large scale since the advent of Connectivity Map, for indication discovery and repurposing, including compounds tested in animal models.29 While it is difficult to pinpoint where exactly gene expression data generally (or Connectivity Map data specifically) has been used for drug discovery, those cases do exist—for example, in particular subtypes of acute lymphoblastic leukemia (ALL), gene expression data suggested HDAC inhibitors as a treatment,30 which were later found to be efficacious in in vivo mouse models.31 Given that many factors influence decisions in a drug discovery project, it is difficult to identify gene expression data as the true starting point for a project—but ample evidence exists that it is certainly one useful starting point.
More recent publications on the topic32 integrate human gene expression, drug perturbation, and clinical data, and in this case for hyperlipidemia and hypertension, which based on an analysis of >21,000 compounds was able to replicate 10 approved drugs and identify 25 drugs approved for other indications with effects on clinically relevant biomarkers, which hence provides a first indication for efficacy.
Methodologically, e.g., Chemical-Induced Gene Expression Ranking (CIGER)33 has been a recent development, which has been validated in the pancreatic cancer context, albeit only on cell lines, and comparison to other methods has not been performed in this study. Utilizing a successor of the original Connectivity Map, the Library of Integrated Network-based Cellular Signatures (LINCS), as a database, and with the aim to provide better benchmarking, recent work34 identified homoharringtonine as a potential treatment for liver cancer, which was validated in vivo using xenograft and carbon tetrachloride-induced liver fibrosis models. Translation to humans remains an open question, but the aim to establish better benchmarking approaches is laudable.
One of the companies that explored using transcriptomics data for in-house discovery early on is Janssen, which recently used L1000 data for 31,000 compounds35 from Janssen’s primary screening deck to assess the potential of transcriptomics data to identify similar compounds and to generate target activity predictions and the scaffold-hopping potential of the resulting hits. For several targets, high-performing predictive models for on-target activity were obtained (balanced accuracy values ≥80%), which were also prospectively validated for novel scaffolds. One of the key questions is for which targets suitable predictive signals can be found in gene expression data, which is not yet entirely clear generally across endpoints, also due to differences in data generation and processing.
One of the key potential advantages of using gene expression data for compound characterization is that the signal is intrinsically biologically meaningful in that genes have a meaning with respect to their functions in pathways and in the cell as a whole. That this can be exploited practically has been shown when selecting agents for differentiation therapy in leukemia,36 where compounds were able to de-differentiate leukemia cells to granulocytes, which has also been the mechanistic selection criterion for compounds. Likewise, the application of using transcriptomics data to select small molecules that influence cellular differentiation has advantages, in that it enables therapy to move from transcription factors (with usually unsuitable pharmacokinetics (PK) profiles) to small molecules. That this is feasible in practice has been shown in a study37 which proposed and successfully validated several small molecules for differentiating stem cells to cardiomyocytes, which were profiled on the genetic and proteomics levels.
Compound Combinations
Gene expression data can also be used for combination drug discovery , with the caveat that combination treatment transcriptomics data is usually not available, and that hence some synergy (or other combination) hypothesis is needed to select suitable combinations. In practice, combination treatment is often different from just the “sum” of the monotreatments when it comes to gene expression responses,38 which has also been exemplified, e.g., in the area of adaptogens in more detail.39 This makes the selection of compound combinations with the desired effect based on single-compound data not trivial; on the other hand, having some combination hypothesis, such as that activity on the same pathway (or complementary pathways), is still a starting point for experimental study.
On the cellular level, gene expression data for drug combination discovery has been used, e.g., in pancreatic cancer drug discovery,40 which was then validated prospectively by testing 30 compounds (and their combinations) on PANC-1 cells. Compounds suggested as combination agents with the standard therapy gemcitabine, based on the best-performing scoring system, showed on average 2.82–5.18 times higher synergies compared to compounds that were predicted to be active as single agents. Gene expression data has also been used successfully for explaining synergistic effects, such as for the combination of fucoxanthin and the phosphatidylinositol 3-kinase (PI3K) inhibitor LY-294002,41 where further mechanistic insight into compound action could be gained.
Prediction of Gene Expression
Gene expression data is not available for all compounds, and hence understandably recent approaches also aimed to predict transcriptomics changes based on chemical structure. This is a nontrivial exercise, given the high-dimensional output space and the relatively small number of data points available, which in addition possesses strong analogue bias.
One recent approach, termed MultiDCP,42 aimed to take differences between cell lines as well as dose into account when predicting gene expression changes as well as changes in cell viability upon compound treatment. The transformer-based method has shown, according to the authors, that predicted drug-induced gene expressions demonstrate a stronger predictive power than experimentally derived data itself for downstream tasks.
Generative models have become tremendously popular recently, and hence also generative models for generating compounds with a desired transcriptomics effect have been proposed.43 The method was validated by the similarity of generated molecules to known bioactives, which led to generally plausible results, and conditioning of the Generative Adversarial Network (GAN) employed on transcriptomics data was hence successful.
Safety
While the above applications of using gene expression data related to indication discovery and repurposing are plentiful, given the inherent multidimensionality of safety endpoints and the relative lack of availability of data, maybe applications in this field stand to gain even more when it comes to identifying the most suitable compounds to progress to the clinic. The experience of the authors, with colleagues, based on gene expression readouts44,45 has led to the conclusion that “things are not easy”—due to the multidimensional nature of transcriptomics readouts (and the unclear link to decisions), noise in the data, and the different nature of every project with many unknowns.
Using omics data for safety is driven by both technology advances and also the insight that broad profiling, across chemical and target space, is a huge endeavor, which still will be limited severely across both domains, and where omics data is possibly able to generate a broader view, at least on the biological response side. Hence also agencies changed their approach in the past years, such as the U.S. Environmental Protection Agency (EPA), moving toward the generation and working toward the regulatory acceptance of omics data in risk assessment.46 Challenges to be tackled are of technological, scientific, data analysis, standardization, and interpretation nature, and beyond, before methods will be able to advance public health and regulatory decisions (requirements for which have been discussed in ref (47)).
Given that animal testing has not been allowed for consumer products in Europe for a number of years, animal-free NAMs have been explored heavily by related companies; however, it is largely not clear which assays to utilize for decision-making related to which endpoint in practice. A recent study48 provides one of the few “complete” approaches in this area, for a hypothetical product containing 0.1% coumarin in face cream and body lotion and excluding existing animal and human data on coumarin in the process. Plasma Cmax was estimated using a physiologically based kinetic model for dermally applied coumarin, while systemic toxicity was assessed using a battery of in vitro NAMs to identify points of departure (PoDs) for a variety of biological effects and combined with ToxCast data, an in vitro cell stress panel, and high-throughput transcriptomics. The predicted Cmax values for face cream and body lotion were lower than all PoDs, with a margin of safety higher than 100; thus genotoxicity was ruled out and both receptor panel and immunomodulatory effects at consumer-relevant exposures were negative, hence showing a first successful comprehensive case study in the area. Applicability to future compounds and applications contexts remains key to proceed further.
Point of Departure Modeling
“Points of Departure” (PoDs) are a concept from safety assessment, describing the concentration when a biological response (e.g., transcriptional changes) is observed, and some recent studies aimed to refine the state of the art in this area further. Basili et al.49 employed a method combining prior information with available data, using the Pathway-Level Information ExtractoR (PLIER) algorithm to identify latent variables (LVs) describing biological activity, and the authors then analyzed those LVs using the ToxCast pipeline. For 44 chemicals in MCF-7 cells, they showed that the workflow was able to discriminate between estrogenic and anti-estrogenic compounds. Approaches which combine data with prior knowledge will likely be key in the future, given the size of chemical and biological space and our inability to sample it properly.
A study in the ecotoxicology context50 comprising a total of 10 different compound interventions found that transcriptomics-based PoDs were mostly lower than apical PoDs; however, for extrapolations to some organisms this was not the case. One key finding was that the number of genes included in a PoD signature was significantly related to its robustness; in this particular study design, fewer than 15 differentially expressed genes were found likely to be unreliable for screening.
Time Series Analysis
While the above studies on PoD put transcriptomics changes in relation to concentration, for toxic endpoints also changes over time matter—in particular, genes can change transcription as a consequence of damage, or as a causal factor, which makes an important practical difference, such as in the construction of Adverse Outcome Pathways (AOPs). To investigate this in the context of non-alcoholic fatty liver disease (NAFLD), recent work51 used data for 28 steatotic chemicals with gene expression data measured at three time points and three doses to describe compound effect, and hence mechanisms leading to NAFLD, as a function of time.
Also in the drug-induced liver injury (DILI) area, a time-series analysis of transcriptomics data considering the Bradford–Hill criteria has been performed,52 which includes whether events are consistently observed in a certain temporal order, and hence this work introduces the concept of “first activation” as a data-driven means to generate hypotheses on potentially causal mechanisms. Data from TG-GATEs comprised time points from 3 h to 4 weeks post-treatment, and both known and potential novel mechanisms involving DILI were identified. In addition, transcription factor analysis was performed, also including prior information into the analysis of data. As with similar analyses of this type, data remains a challenge—whether inter-individual variation, the high dimensionality of readouts, lack of data across time points, or dependence of any results on the choice of specific parameters.
Mode of Action Analysis
“Mode of action” is a term easily said, but it can conceptually mean many things (direct targets, downstream modulation, etc.), and on the other hand really understanding the mode of action of a compound (with all its polypharmacology etc.) in a given disease context is a severe challenge. Accordingly, also labeling data with “modes of action” is tricky, as further outlined in a recent review.11
The analysis of modes of action has been done with the original CMap publication,18 e.g., for Trichostatin A, which also has been revisited by other technologies, such as TempO-Seq.9 However, in the experience of the authors, there is still a lack of understanding where each readout contains a suitable signal, given a way to generate data and a particular data analysis pipeline, leading to a biased representation of case studies—and there is likely a reason that, e.g., tubulin inhibitors, or said Trichostatin A, feature so often in such case studies, namely that this is a signal that is more easily observable than others. The question remains which on-target and pathway activities are observable in a given readout type in practice—and if we are unable to describe that, then we are unable to understand whether activity has not been observed because the compound does not cause it or whether it fundamentally cannot be captured by a certain type of readout.
In order to combine transcriptomics data with existing prior information, recent work53 employed causal reasoning to benchmark existing algorithms (SigNet, CausalR, CausalR ScanR, and CARNIVAL), networks (OmniPath vs MetaBase) for a data set of 269 mode-of-action-annotated compounds. It was found from an ANOVA analysis that the combination of algorithm and network most significantly dictated the performance of causal reasoning algorithms, with the effect of the data source being less strong. All causal reasoning algorithms also outperformed pathway recovery based on input Differently Expressed Genes (DEGs), with performance being somewhat correlated with connectivity and biological role of the targets.
Cell Painting Data
Technology
Compared to gene expression profiling, cell morphology profiling grew to maturity for practical purposes probably about a decade later (say, the early 2010s, compared to the early 2000s, although those numbers depend heavily on use case and technology). One significant practical development was the standardization of cell morphology readouts in the form of the Cell Painting10 assay at the Broad Institute, which aimed to enable large-scale data generation, and which is hence as essential as, say, the development of a microarray provider on the transcriptomics level. An obvious difference is the establishment of one readout type (microarrays) in the form of a standardized physical tool provided by commercial vendors, and the other readout type (Cell Painting) in the form of a publication, with no single provider for such services available. The Cell Painting assay has been shown to be robust also with slight variations of protocol and reagent vendors being used,54 which in addition to low running cost per data point (once the experimental environment has been established) adds to its advantages. In analogy to transcriptomics data, public data sets have been generated on a large scale, which has recently been made available by the Broad Institute (https://jump-cellpainting.broadinstitute.org/).
Given its relative youth, the exploration of where cell morphology, and in particular Cell Painting data, can be used for drug discovery is still not very advanced. Some publications exist which determine best practices for data analysis from Cell Painting screens (reviewed recently55). One such study56 explored curve fitting at several levels of data aggregation and on computed metrics, and hit identification strategies based on single-concentration analysis included measurement of total effect magnitude and correlation of profiles among biological replicates. Overall, most of the methods achieved a 100% hit rate for the reference chemical and high concordance for 82% of test chemicals, indicating that hit calls obtained for this data set are robust across different analysis approaches. As with gene expression data, batch correction needs to be performed also with Cell Painting readouts. Given that Cell Painting is about 15 years more recent compared to microarray technologies, agreement on which methods to use is still in its infancy; a data set to this end has now been made public.57 A recent review58 summarizes the current state of the art when it comes to data analysis, with the authors describing it as moving from unsupervised and often clustering-based methods currently toward incorporation of more sophisticated machine-learning algorithms in the future. Still, there is some way to go when it comes to establishing best practices from data generation to analysis and decision-making.
Efficacy/Repurposing
While gene expression data has been used in a large number of repurposing studies so far, Cell Painting data has generally not reached this stage yet. However, this will likely change in the future, given that data for a large number of existing drugs is now also in the public domain,59 so that corresponding workflows will likely be set up and experimentally validated.
Given that cellular systems can be modulated by compounds, but also via, e.g., overexpression of genes (as well as in other ways), of course data from multiple dimensions can be used for, e.g., virtual screening. One recent study60 employed cell morphology images from gene overexpression to identify matching images from the small-molecule side, thereby identifying modulators of three target genes which were experimentally validated. While the imaging data obtained can hence be reused for novel targets, it also needs to be said that only for a minority of genes in the currently available data could compound matches be found, and that also not in all cases were gene overexpression and compound modulation phenotypes matched, and that hence a positive selection of cases that successfully worked has been performed.
Cell Painting data has also been used for compound selection against dihydroorotate dehydrogenase (DHODH)61 as well as tubulin,62 where in particular in the latter case it is understandable that a suitable, visually apparent cellular phenotype for such compound selection exists.
For esophageal adenocarcinoma, Cell Painting data has been used for compound selection, leading to 51 validated and selective hits out of a total library size of 19,555 compounds.63 For the most potent and selective hits, namely elesclomol, disulfiram, and ammonium pyrrolidinedithiocarbamate, mode-of-action elucidation led to copper-dependent cell death, which was then proposed as a novel way of targeting esphageal adenocarcinoma in the future.
As in the case of transcriptomics, also generative models have been applied to cell morphology data.64 Models were conditioned on cell morphology profiles of 30,000 compounds using their Cell Painting morphological profiles as conditioning. The model generated plausible chemistry, which in some cases resembled compounds with the desired bioactivities in the ExCAPE database. However, truly prospective testing of the model with novel compounds has not been performed.
Safety
As described for transcriptomics data, also cell morphology, and Cell Painting data in particular, has been under intensive evaluation for safety endpoints, such as by the EPA.46
The predictive value of a readout, given a particular assay setup and data processing pipeline, needs to be validated for every endpoint individually. This has been done for mitochondrial toxicity,65 using Cell Painting, gene expression, and compound structural information, for 382 chemical perturbants tested in the Tox21 mitochondrial membrane depolarization assay. Mitochondrial toxicants were found to differ from nontoxic compounds in morphological space, and when included in predictive models this combination of features improved model performance on an external test set of 244 significantly, thereby improving extrapolation to new chemical space.
For new modalities, such as PROteolysis TArgeting Chimeras (PROTACs), it is not clear how to profile compounds for safety, and the Cell Painting assay has been evaluated for this purpose recently as well.66 It was found that the signal contained in Cell Painting readouts with respect to mitochondrial toxicity of PROTACs was concentration-dependent, with 10 μM and 1 μM offering better signals for model generation than a concentration of 0.1 μM, and with the model based on 1 μM concentration data offering virtually perfect classification on a prospective test set.
On a wider scale, a recent study used Cell Painting readouts to predict 70 cell health phenotypes across a wide range of biology,67 and it was generally judged to be successful. The practical impact is that potentially endpoints are predictive of safety in the clinic, although this needs to be evaluated further.
Cell Painting has also been employed68 to assess combination effects of cetyltrimethylammonium bromide, bisphenol A, and dibutyltin dilaurate on four human cell lines, and it was found that bisphenol A exacerbates morphological effects of the other two compounds. Importantly, in this work effects have been found to be cell line-dependent, hence requiring the assay setup to be chosen carefully to be relevant for the in vivo situation one aims to predict.
Mode-of-Action Analysis
Attempts to use cell morphology data to understand a compound’s mode of action date back a number of years, and early attempts included an integrated view of morphology-based readouts with computational target prediction, showing the complementarity of both readout types.69
Cell Painting data has been used in multiple studies so far for mode-of-action analysis, such as in recent work70 which found for a set of 10 well-represented MoA classes that the macro-averaged F1 score of 0.58 when training on only the structural data but increased to 0.81 when training on only the image data and 0.92 when training on both input data types together. Note that not all bioactivity classes might be as well-populated or as easily annotated with a “mode of action”, and hence also here some kind of self-selection has been performed in this study. The general finding, of improved classification performance, is nonetheless consistent with a study cited above,65 which in particular for novel scaffolds underlined the value of including Cell Painting readouts in predicting biological endpoints.
One of the earlier studies utilizing Cell Painting data for target prediction/virtual screening71 for glucocorticoid receptor translocation was able to predict assay-specific biological activity in two ongoing drug discovery projects based on predefined image features. Here, out of 535 assays, 5.8% and 8.0% could be predicted with AUC-ROC at 0.9 or larger using Macau and Deep Neural Networks as classification methods, while for an AUC-ROC value of 0.7 this number was 40.7% and 45.8%, respectively. More recent work72 has been able to utilize convolutional neural networks (CNNs) on the images directly, and this study (albeit on a different data set) was able to improve upon the above numbers, now predicting 32% of the 209 biological assays at high predictive performance (AUC > 0.9). This means that Cell Painting data has the potential of being predictive for those targets also in a prospective discovery situation and, if this proves to be true in the future, to potentially replace such assays which are currently run on individual targets.
Also combining information from multiple domains, such as chemical structure and imaging information, has been performed using Bayesian matrix factorization (BMF Macau) methods and compared to Random Forest.66 It was found that both methods performed similarly when ECFP fingerprints were used as compound descriptors. However, BMF Macau outperformed Random Forest in 69.20% of cases when image data was used as compound descriptors. As demonstrated also in some of the above studies, here cell morphology information added considerable value for predictions of relatively diverse chemistry, where cell morphology endpoints were similar despite different chemistry, such as when targeting β-catenin.73
To conclude, a wide variety of methods to generate omics data (here focused on gene expression and Cell Painting data) exists, but there is no general consensus (a) for which chemical structure and (b) for which biological type of data (generated using which particular assay setup) and (c) using which analysis method is relevant for decision making for (d) which type of clinically relevant endpoint. This is the subject of much ongoing research, some of which has been summarized above, and this also makes the concept of an “Applicability Domain” multidimensional (extending significantly beyond the chemical domain, where it is conventionally applied), and hence much more complex than in a single domain alone (since applicability in one domain is conditional on the setup across all other domains).
So where do transcriptomics and Cell Painting data stand currently (admittedly subjective in the opinion of the authors)? The attempt of a summary, being aware that this is too broad and entirely domain-specific, is shown in Table 2. It can be seen that we are not quite there yet—but we are constantly moving forward, albeit slowly.
Table 2. Assessment of Transcriptomics and Cell Morphology/Cell Painting Data to Make Practical Impact in Drug Discovery Projects.
| Endpoint property | Gene expression | Cell painting |
|---|---|---|
| Reproducibility | Depends on platform; technical reproducibility often sufficient, while biological reproducibility can represent problems | Reproducibility of assay setup across different types of experimental platforms can be non-trivial; once set up, reproducibility is generally fit for the purpose (although, based on personal communication, person-to-person variation between technicians can influence results) |
| Predictivity of signal to in vivo endpoints | Needs to be established on a case-by-case basis; some work has been done over the past 10–15 years, but much remains to be done | Translation to in vivo endpoints still largely needs to be established (which is a currently intense area of research) |
| Causality | Data itself is not causal, e.g., time-course analysis needs to be performed to work toward causality | Data itself is not causal, e.g., time-course analysis needs to be performed to work toward causality |
| Wide applicability/necessity of tailoring | Potentially widely applicable readout, but applicability domain still needs to be established further | Potentially widely applicable readout, but applicability domain still needs to be established further |
| Standardized data processing pipeline | Data processing somewhat standardized but depends heavily on type of input data available (microarray data processing is more standardized than bulk RNA-Seq processing, which is more standardized than processing of single-cell data currently) | Given that a standard protocol for Cell Painting assays exists, this has in principle fewer degrees of freedom than in the case of gene expression data, but no definite standards exist currently w.r.t. normalization, removing highly correlated features, etc. |
| Interpretability/allows inclusion of prior information | Gene expression data is inherently interpretable on the gene and pathways level; allows for inclusion of pathway information | Cell morphology data is inherently (by itself) not interpretable on the gene level; does not per se allow for the inclusion of pathway information (but readouts can be linked to landmark compounds etc.) |
| Speed, cost | Depends on platform; newer platforms have low cost and are fully automatable | Generally low running cost but high upfront investment (financially and related to setting up machinery) |
Of course, biology does not only exist in the distinct layers of gene expression data and cell morphology data (and beyond)—and also links between domains exist, which have been analyzed before74,75 with the aim to understand the relative information content related to particular areas of chemical and response space better. In one of the studies,75 it has been found that Cell Painting measures fewer distinct groups of features compared to L1000 gene expression readouts, where hence L1000 transcriptomics data was also able to distinguish a larger number of modes of action. In general, both assays provided complementary information to each other in this work.
In biological assays often concentration/dose, time point, and cell line are key parameters that can significantly change results. For the Cell Painting assay, that by default uses the U2OS osteosarcoma cell line, recent work76 has shown relative insensitivity to cell line choice (as opposed to a work cited earlier which studied combination effects of a smaller set of compounds in four cell lines68) for 14 phenotypic reference chemicals across six biologically diverse human-derived cell lines (U2OS, MCF7, HepG2, A549, HTB-9, and ARPE-19 cells). Image acquisition settings and cell segmentation parameters needed to be adjusted for each cell type, but not the cytochemistry protocol. The more biological responses generalize across assay parameters such as cell line, the easier the above “Applicability Domain” question becomes—however, there will be natural limits to it, given that different cells are meant to respond differently to different stimuli after some thresholds have been surpassed.
For achieving in vivo relevant predictions, extrapolation to the whole organism system at therapeutically relevant concentrations is required. To this end, recently in vivo PK models have been published to predict in vivo PK directly based on chemical structure,77−79 but the question of how to integrate (a) the dose and (b) a detected signal in proxy space and to link this to (c) a decision-making point deserves further attention.
To really advance the field, consortia will be needed to explore biological readout space and its utilization to predict in vivo relevant endpoints further—some of those consortia have operated in the transcriptomics data in the past, such as the QSTAR Consortium supported by Janssen,44 while in the Cell Painting area the JUMP consortium has made data available to the public domain on a large scale. Those consortia will achieve best results if they include aspects of prospective experimental design across the chemical space to be tested, agreement on the biological setup of data generation (cell line, dose, time point, etc.) and data handling, as well as clear links to in vivo relevant toxicity endpoints. This is in contrast to consortia based only on “data sharing” of data that has been generated for a completely different purpose in the past, and where chemistry used, assay setup, and annotations employed do not usually match a given new purpose. Likewise, the sharing of results needs to be performed on a large scale, to understand better which type of readout is predictive for which type of in vivo endpoint.
Due to the size of biological hypothesis space, this will in most cases likely necessitate the inclusion of prior information in addition to data, as well as including PK information to assess any type of efficacy- and safety-related endpoints, leading to two possible model architectures described in Figure 1: (1) Either safety endpoints, such as DILI, require a certain amount of mechanistic understanding, leading to the insight that BSEP inhibition is one of the mechanistic factors that can lead to DILI; then relevant chemical space gets tested with respect to this endpoint and models get combined with PK models to make in vivo relevant predictions. However, here for every endpoint (or toxicity pathway) experimental data is required. (2) Alternatively, and this is the great promise of systems-based assays such as the ones presented here, several or even many modes of toxicity can be captured in a single assay, where the toxicity-relevant signal is combined with PK/exposure information to arrive at predictions relevant for the in vivo situation. To understand better which endpoints to be measured independently in an assay (in the first case), or which variables to be considered as predictive (in the second case) for the in vivo situation is the subject of intense ongoing research.
Figure 1.
Architecture of in vivo-relevant endpoint models, combining either systems-based assay readouts (top), such as transcriptomics or Cell Painting based readouts with PK/exposure models to arrive at in vivo relevant predictions, or endpoint-based assays with PK models for this purpose (bottom). The great advantage of using systems-based assays would be that no individual assays are required for each individual toxicity (or also efficacy) endpoint, but that rather a systems-based readout is able to be predictive for a variety of in vivo effects in a quantitative manner.
Acknowledgments
Emmanuel Gustin and Jens Peter von Kries are thanked for their input on the topics discussed in this work.
The authors declare the following competing financial interest(s): A.B. is a co-founder and shareholder of Healx Ltd. and Pharmenable Ltd., an employee and shareholder of Pangea Botanica Ltd., and has been and is a consultant to various pharmaceutical companies.
Special Issue
Published as part of the ACS Medicinal Chemistry Letters virtual special issue “New Enabling Drug Discovery Technologies - Recent Progress”.
References
- Haghighi M.; Caicedo J. C.; Cimini B. A.; Carpenter A. E.; Singh S. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nat. Methods 2022, 19, 1550–1557. 10.1038/s41592-022-01667-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chow Y. L.; Singh S.; Carpenter A. E.; Way G. P. Predicting drug polypharmacology from cell morphology readouts using variational autoencoder latent space arithmetic. PLoS Comput. Biol. 2022, 18, e1009888 10.1371/journal.pcbi.1009888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walton R. T.; Singh A.; Blainey P. C. Pooled genetic screens with image-based profiling. Mol. Syst. Biol. 2022, 18, e10768 10.15252/msb.202110768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scannell J. W.; Bosley J. When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis. PLoS One 2016, 11, e0147215 10.1371/journal.pone.0147215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scannell J. W.; Bosley J.; Hickman J. A.; Dawson G. R.; Truebel H.; Ferreira G. S.; Richards D.; Treherne J. M. Predictive validity in drug discovery: what it is, why it matters and how to improve it. Nat. Rev. Drug Discovery 2022, 21, 915–931. 10.1038/s41573-022-00552-x. [DOI] [PubMed] [Google Scholar]
- Bender A.; Cortés-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. Drug Discovery Today. 2021, 26, 511–524. 10.1016/j.drudis.2020.12.009. [DOI] [PubMed] [Google Scholar]
- Bender A.; Cortes-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discovery Today 2021, 26, 1040–1052. 10.1016/j.drudis.2020.11.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye C.; Ho D. J.; Neri M.; Yang C.; Kulkarni T.; Randhawa R.; Henault M.; Mostacci N.; Farmer P.; Renner S.; Ihry R.; Mansur L.; Keller C. G.; McAllister G.; Hild M.; Jenkins J.; Kaykas A. DRUG-seq for miniaturized high-throughput transcriptome profiling in drug discovery. Nat. Commun. 2018, 9, 4307. 10.1038/s41467-018-06500-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeakley J. M.; Shepard P. J.; Goyena D. E.; VanSteenhouse H. C.; McComb J. D.; Seligmann B. E. A trichostatin A expression signature identified by TempO-Seq targeted whole transcriptome profiling. PLoS One 2017, 12, e0178302 10.1371/journal.pone.0178302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bray M. A.; Singh S.; Han H.; Davis C. T.; Borgeson B.; Hartland C.; Kost-Alimova M.; Gustafsdottir S. M.; Gibson C. C.; Carpenter A. E. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 2016, 11, 1757–1774. 10.1038/nprot.2016.105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trapotsi M.-A.; Hosseini-Gerami L.; Bender A. Computational analyses of mechanism of action (MoA): data, methods and integration. RSC Chem. Biol. 2022, 3, 170–200. 10.1039/D1CB00069A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lockhart D. J.; Dong H.; Byrne M. C.; Follettie M. T.; Gallo M. V.; Chee M. S.; Mittmann M.; Wang C.; Kobayashi M.; Norton H.; Brown E. L. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 1996, 14, 1675–1680. 10.1038/nbt1296-1675. [DOI] [PubMed] [Google Scholar]
- Pollack J. R.; Perou C. M.; Alizadeh A. A.; Eisen M. B.; Pergamenschikov A.; Williams C. F.; Jeffrey S. S.; Botstein D.; Brown P. O. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet. 1999, 23, 41–46. 10.1038/12640. [DOI] [PubMed] [Google Scholar]
- Subramanian A.; Narayan R.; Corsello S. M.; Peck D. D.; Natoli T. E.; Lu X.; Gould J.; Davis J. F.; Tubelli A. A.; Asiedu J. K.; Lahr D. L.; Hirschman J. E.; Liu Z.; Donahue M.; Julian B.; Khan M.; Wadden D.; Smith I. C.; Lam D.; Liberzon A.; Toder C.; Bagul M.; Orzechowski M.; Enache O. M.; Piccioni F.; Johnson S. A.; Lyons N. J.; Berger A. H.; Shamji A. F.; Brooks A. N.; Vrcic A.; Flynn C.; Rosains J.; Takeda D. Y.; Hu R.; Davison D.; Lamb J.; Ardlie K.; Hogstrom L.; Greenside P.; Gray N. S.; Clemons P. A.; Silver S.; Wu X.; Zhao W.-N.; Read-Button W.; Wu X.; Haggarty S. J.; Ronco L. V.; Boehm J. S.; Schreiber S. L.; Doench J. G.; Bittker J. A.; Root D. E.; Wong B.; Golub T. R. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 2017, 171, 1437–1452.e17. 10.1016/j.cell.2017.10.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H.; Qiu J.; Fu X.-D. RASL-seq for massively parallel and quantitative analysis of gene expression. Curr. Protoc. Mol. Biol. 2012, 98, 4.13.1–4.13.9. 10.1002/0471142727.mb0413s98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bushel P. R.; Paules R. S.; Auerbach S. S. A Comparison of the TempO-Seq S1500+ Platform to RNA-Seq and Microarray Using Rat Liver Mode of Action Samples. Front. Genet. 2018, 9, 485. 10.3389/fgene.2018.00485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bush E. C.; Ray F.; Alvarez M. J.; Realubit R.; Li H.; Karan C.; Califano A.; Sims P. A. PLATE-Seq for genome-wide regulatory network analysis of high-throughput screens. Nat. Commun. 2017, 8, 105. 10.1038/s41467-017-00136-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lamb J.; Crawford E. D.; Peck D.; Modell J. W.; Blat I. C.; Wrobel M. J.; Lerner J.; Brunet J. P.; Subramanian A.; Ross K. N.; Reich M.; Hieronymus H.; Wei G.; Armstrong S. A.; Haggarty S. J.; Clemons P. A.; Wei R.; Carr S. A.; Lander E. S.; Golub T. R. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006, 313, 1929. 10.1126/science.1132939. [DOI] [PubMed] [Google Scholar]
- Li J.; Ho D. J.; Henault M.; Yang C.; Neri M.; Ge R.; Renner S.; Mansur L.; Lindeman A.; Kelly B.; Tumkaya T.; Ke X.; Soler-Llavina G.; Shanker G.; Russ C.; Hild M.; Gubser Keller C.; Jenkins J. L.; Worringer K. A.; Sigoillot F. D.; Ihry R. J. DRUG-seq Provides Unbiased Biological Activity Readouts for Neuroscience Drug Discovery. ACS Chem. Biol. 2022, 17, 1401–1414. 10.1021/acschembio.1c00920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrill J. A.; Everett L. J.; Haggard D. E.; Sheffield T.; Bundy J. L.; Willis C. M.; Thomas R. S.; Shah I.; Judson R. S. High-Throughput Transcriptomics Platform for Screening Environmental Chemicals. Toxicol. Sci. 2021, 181, 68–89. 10.1093/toxsci/kfab009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou W.; Koudijs K. K. M.; Böhringer S. Influence of batch effect correction methods on drug induced differential gene expression profiles. BMC Bioinformatics 2019, 20, 437. 10.1186/s12859-019-3028-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu L.; Wang X.; Mu Q.; Tam S. S. T.; Loi D. S. C.; Chan A. K. Y.; Poon W. S.; Ng H. K.; Chan D. T. M.; Wang J.; Wu A. R. scONE-seq: A single-cell multi-omics method enables simultaneous dissection of phenotype and genotype heterogeneity from frozen tumors. Sci. Adv. 2023, 9, eabp8901 10.1126/sciadv.abp8901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salmen F.; De Jonghe J.; Kaminski T. S.; Alemany A.; Parada G. E.; Verity-Legg J.; Yanagida A.; Kohler T. N.; Battich N.; van den Brekel F.; Ellermann A. L.; Arias A. M.; Nichols J.; Hemberg M.; Hollfelder F.; van Oudenaarden A. High-throughput total RNA sequencing in single cells using VASA-seq. Nat. Biotechnol. 2022, 40, 1780–1793. 10.1038/s41587-022-01361-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson E. R.; Taherian Fard A.; Mar J. C. Computational Methods for Single-Cell Imaging and Omics Data Integration. Front. Mol. Biosci. 2022, 8, 768106. 10.3389/fmolb.2021.768106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park J.; Kim J.; Lewy T.; Rice C. M.; Elemento O.; Rendeiro A. F.; Mason C. E. Spatial omics technologies at multimodal and single cell/subcellular level. Genome Biol. 2022, 23, 256. 10.1186/s13059-022-02824-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He B.; Xiao Y.; Liang H.; Huang Q.; Du Y.; Li Y.; Garmire D.; Sun D.; Garmire L. X. ASGARD: A Single-cell Guided pipeline to Aid Repurposing of Drugs. arXiv Preprint 2021, 10.48550/arXiv.2109.06377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu F.; Fan J.; He Y.; Xiong A.; Yu J.; Li Y.; Zhang Y.; Zhao W.; Zhou F.; Li W.; Zhang J.; Zhang X.; Qiao M.; Gao G.; Chen S.; Chen X.; Li X.; Hou L.; Wu C.; Su C.; Ren S.; Odenthal M.; Buettner R.; Fang N.; Zhou C. Single-cell profiling of tumor heterogeneity and the microenvironment in advanced non-small cell lung cancer. Nat. Commun. 2021, 12, 2540. 10.1038/s41467-021-22801-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y.; Zhang T.; Zhou Q.; Hu M.; Qi Y.; Xue Y.; Wang L.; Nie Y.; Bao Z.; Shi W. A single-cell transcriptome atlas of human early embryogenesis. bioRxiv Preprint 2021, 10.1101/2021.11.30.470583. [DOI] [Google Scholar]
- Keenan A. B.; Wojciechowicz M. L.; Wang Z.; Jagodnik K. M.; Jenkins S. L.; Lachmann A.; Ma’ayan A. Connectivity Mapping: Methods and Applications. Annu. Rev. Biomed. Data Sci. 2019, 2, 69–92. 10.1146/annurev-biodatasci-072018-021211. [DOI] [Google Scholar]
- Stumpel D. J.; Schneider P.; Seslija L.; Osaki H.; Williams O.; Pieters R.; Stam R. W. Connectivity mapping identifies HDAC inhibitors for the treatment of t(4;11)-positive infant acute lymphoblastic leukemia. Leukemia 2012, 26, 682–692. 10.1038/leu.2011.278. [DOI] [PubMed] [Google Scholar]
- Garrido Castro P.; van Roon E. H. J.; Pinhanços S. S.; Trentin L.; Schneider P.; Kerstjens M.; Te Kronnie G.; Heidenreich O.; Pieters R.; Stam R. W. The HDAC inhibitor panobinostat (LBH589) exerts in vivo anti-leukaemic activity against MLL-rearranged acute lymphoblastic leukaemia and involves the RNF20/RNF40/WAC-H2B ubiquitination axis. Leukemia 2018, 32, 323–331. 10.1038/leu.2017.216. [DOI] [PubMed] [Google Scholar]
- Wu P.; Feng Q.; Kerchberger V. E.; Nelson S. D.; Chen Q.; Li B.; Edwards T. L.; Cox N. J.; Phillips E. J.; Stein C. M.; Roden D. M.; Denny J. C.; Wei W. Q. Integrating gene expression and clinical data to identify drug repurposing candidates for hyperlipidemia and hypertension. Nat. Commun. 2022, 13, 46. 10.1038/s41467-021-27751-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pham T.-H.; Qiu Y.; Liu J.; Zimmer S.; O’Neill E.; Xie L.; Zhang P. Chemical-induced gene expression ranking and its application to pancreatic cancer drug repurposing. Patterns (N Y). 2022, 3, 100441. 10.1016/j.patter.2022.100441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang C.; Zhang H.; Chen M.; Wang S.; Qian R.; Zhang L.; Huang X.; Wang J.; Liu Z.; Qin W.; Wang C.; Hang H.; Wang H. A survey of optimal strategy for signature-based drug repositioning and an application to liver cancer. eLife 2022, 11, e71880 10.7554/eLife.71880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Wolf H.; Cougnaud L.; Van Hoorde K.; De Bondt A.; Wegner J. K.; Ceulemans H.; Goehlmann H. High-Throughput Gene Expression Profiles to Define Drug Similarity and Predict Compound Activity. Assay Drug Dev Technol. 2018, 16, 162–176. 10.1089/adt.2018.845. [DOI] [PubMed] [Google Scholar]
- KalantarMotamedi Y.; Ejeian F.; Sabouhi F.; Bahmani L.; Nejati A. S.; Bhagwat A. M.; Ahadi A. M.; Tafreshi A. P.; Nasr-Esfahani M. H.; Bender A. Transcriptional drug repositioning and cheminformatics approach for differentiation therapy. Sci. Rep. 2021, 11, 12537. 10.1038/s41598-021-91629-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- KalantarMotamedi Y.; Peymani M.; Baharvand H.; Nasr-Esfahani M. H.; Bender A. Systematic selection of small molecules to promote differentiation of embryonic stem cells and experimental validation for generating cardiomyocytes. Cell Death Disc. 2016, 2, 16007. 10.1038/cddiscovery.2016.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diaz J. E.; Ahsen M. E.; Schaffter T.; Chen X.; Realubit R. B.; Karan C.; Califano A.; Losic B.; Stolovitzky G. The transcriptomic response of cells to a drug combination is more than the sum of the responses to the monotherapies. eLife 2020, 9, e52707 10.7554/eLife.52707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Panossian A.; Hamm R.; Kadioglu O.; Wikman G.; Efferth T. Synergy and Antagonism of Active Constituents of ADAPT-232 on Transcriptional Level of Metabolic Regulation of Isolated Neuroglial Cells. Front. Neurosci. 2013, 7, 16. 10.3389/fnins.2013.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- KalantarMotamedi Y.; Choi R. J.; Koh S. B.; Bramhall J. L.; Fan T. P.; Bender A. Prediction and identification of synergistic compound combinations against pancreatic cancer cells. iScience 2021, 24, 103080. 10.1016/j.isci.2021.103080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pruteanu L. L.; Kopanitsa L.; Módos D.; Kletnieks E.; Samarova E.; Bender A.; Gomez L. D.; Bailey D. S. Transcriptomics predicts compound synergy in drug and natural product treated glioblastoma cells. PLoS One 2020, 15, e0239551 10.1371/journal.pone.0239551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Y.; Liu Q.; Qiu Y.; Xie L. Deep learning prediction of chemical-induced dose-dependent and context-specific multiplex phenotype responses and its application to personalized alzheimer’s disease drug repurposing. PLOS Comp. Biol. 2022, 18, e1010367 10.1371/journal.pcbi.1010367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Méndez-Lucio O.; Baillif B.; Clevert D.-A.; Rouquie D.; Wichard J. De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat. Commun. 2020, 11, 10. 10.1038/s41467-019-13807-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Verbist B.; Klambauer G.; Vervoort L.; Talloen W.; Shkedy Z.; Thas O.; Bender A.; Gohlmann H. W.H.; Hochreiter S. Using transcriptomics to guide lead optimization in drug discovery projects: Lessons learned from the QSTAR project. Drug Discov. Today 2015, 20, 505–513. 10.1016/j.drudis.2014.12.014. [DOI] [PubMed] [Google Scholar]
- Alexander-Dann B.; Pruteanu L. L.; Oerton E.; Sharma N.; Berindan-Neagoe I.; Módos D.; Bender A. Developments in toxicogenomics: understanding and predicting compound-induced toxicity from gene expression data. Mol. Omics 2018, 14, 218–236. 10.1039/C8MO00042E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas R. S; Bahadori T.; Buckley T. J; Cowden J.; Deisenroth C.; Dionisio K. L; Frithsen J. B; Grulke C. M; Gwinn M. R; Harrill J. A; Higuchi M.; Houck K. A; Hughes M. F; Hunter E S.; Isaacs K. K; Judson R. S; Knudsen T. B; Lambert J. C; Linnenbrink M.; Martin T. M; Newton S. R; Padilla S.; Patlewicz G.; Paul-Friedman K.; Phillips K. A; Richard A. M; Sams R.; Shafer T. J; Setzer R W.; Shah I.; Simmons J. E; Simmons S. O; Singh A.; Sobus J. R; Strynar M.; Swank A.; Tornero-Valez R.; Ulrich E. M; Villeneuve D. L; Wambaugh J. F; Wetmore B. A; Williams A. J The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency. Toxicol. Sci. 2019, 169, 317–332. 10.1093/toxsci/kfz058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrill J.; Shah I.; Setzer R. W.; Haggard D.; Auerbach S.; Judson R.; Thomas R. S. Considerations for Strategic Use of High-Throughput Transcriptomics Chemical Screening Data in Regulatory Decisions. Curr. Opin. Toxicol. 2019, 15, 64–75. 10.1016/j.cotox.2019.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baltazar M. T.; Cable S.; Carmichael P. L.; Cubberley R.; Cull T.; Delagrange M.; Dent M. P.; Hatherell S.; Houghton J.; Kukic P.; Li H.; Lee M. Y.; Malcomber S.; Middleton A. M.; Moxon T. E.; Nathanail A. V.; Nicol B.; Pendlington R.; Reynolds G.; Reynolds J.; White A.; Westmoreland C. A Next-Generation Risk Assessment Case Study for Coumarin in Cosmetic Products. Toxicol. Sci. 2020, 176, 236–252. 10.1093/toxsci/kfaa048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basili D.; Reynolds J.; Houghton J.; Malcomber S.; Chambers B.; Liddell M.; Muller I.; White A.; Shah I.; Everett L. J.; Middleton A.; Bender A. Latent Variables Capture Pathway-Level Points of Departure in High-Throughput Toxicogenomic Data. Chem. Res. Toxicol. 2022, 35, 670–683. 10.1021/acs.chemrestox.1c00444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Villeneuve D. L.; Le M.; Hazemi M.; Biales A.; Bencic D. C.; Bush K.; Flick R.; Martinson J.; Morshead M.; Rodriguez K. S.; Vitense K.; Flynn K. Pilot testing and optimization of a larval fathead minnow high throughput transcriptomics assay. Curr. Res. Tox. 2023, 4, 100099. 10.1016/j.crtox.2022.100099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aguayo-Orozco A.; Bois F. Y.; Brunak S.; Taboureau O. Analysis of Time-Series Gene Expression Data to Explore Mechanisms of Chemical-Induced Hepatic Steatosis Toxicity. Front. Genet. 2018, 9, 396. 10.3389/fgene.2018.00396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu A.; Han S.; Munoz-Muriedas J.; Bender A. Deriving time-concordant event cascades from gene expression data: A case study for Drug-Induced Liver Injury (DILI). PLOS Comp. Biol. 2022, 18, e1010148 10.1371/journal.pcbi.1010148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hosseini-Gerami L.; Collier D. A.; Laing E.; Evans D.; Broughton H.; Bender A. Benchmarking causal reasoning algorithms for gene expression-based compound mechanism of action analysis. Research Square Preprint 2022, 10.21203/rs.3.rs-1239049/v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cimini B. A.; Chandrasekaran S. N.; Kost-Alimova M.; Miller L.; Goodale A.; Fritchman B.; Byrne P.; Garg S.; Jamali N.; Logan D. J.; Concannon J. B.; Lardeau C.-H.; Mouchet E.; Singh S.; Abbasi H. S.; Aspesi P. Jr; Boyd J. D.; Gilbert T.; Gnutt D.; Hariharan S.; Hernandez D.; Hormel G.; Juhani K.; Melanson M.; Mervin L.; Monteverde T.; Pilling J. E.; Skepner A.; Swalley S. E.; Vrcic A.; Weisbart E.; Williams G.; Yu S.; Zapiec B.; Carpenter A. E. Optimizing the Cell Painting assay for image-based profiling. bioRxiv Preprint 2022, 10.1101/2022.07.13.499171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caicedo J.; Cooper S.; Heigwer F.; Warchal S.; Qiu P.; Molnar C.; Vasilevich A. S.; Barry J. D.; Bansal H. S.; Kraus O.; Wawer M.; Paavolainen L.; Herrmann M. D.; Rohban M.; Hung J.; Hennig H.; Concannon J.; Smith I.; Clemons P. A.; Singh S.; Rees P.; Horvath P.; Linington R. G.; Carpenter A. E. Data-analysis strategies for image-based cell profiling. Nat. Methods 2017, 14, 849–863. 10.1038/nmeth.4397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nyffeler J.; Haggard D. E.; Willis C.; Setzer R. W.; Judson R.; Paul-Friedman K.; Everett L. J.; Harrill J. A. Comparison of Approaches for Determining Bioactivity Hits from High-Dimensional Profiling Data. SLAS Discovery 2021, 26, 292–308. 10.1177/2472555220950245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sypetkowski M.; Rezanejad M.; Saberian S.; Kraus O.; Urbanik J.; Taylor J.; Mabey B.; Victors M.; Yosinski J.; Sereshkeh A. R.; Haque I.; Earnshaw B. RxRx1: A Dataset for Evaluating Experimental Batch Correction Methods. arXiv Preprint 2023, 10.48550/arXiv.2301.05768. [DOI] [Google Scholar]
- Chandrasekaran S. N.; Ceulemans H.; Boyd J. D.; Carpenter A. E. Image-based profiling for drug discovery: due for a machine-learning upgrade?. Nature Rev. Drug Disc. 2021, 20, 145–159. 10.1038/s41573-020-00117-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bray M. A.; Gustafsdottir S. M.; Rohban M. H.; Singh S.; Ljosa V.; Sokolnicki K. L.; Bittker J. A.; Bodycombe N. E.; Dancík V.; Hasaka T. P.; Hon C. S.; Kemp M. M.; Li K.; Walpita D.; Wawer M. J.; Golub T. R.; Schreiber S. L.; Clemons P. A.; Shamji A. F.; Carpenter A. E. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience 2017, 6, 1–5. 10.1093/gigascience/giw014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rohban M. H.; Fuller A. M.; Tan C.; Goldstein J. T.; Syangtan D.; Gutnick A.; DeVine A.; Nijsure M. P.; Rigby M.; Sacher J. R.; Corsello S. M.; Peppler G. B.; Bogaczynska M.; Boghossian A.; Ciotti G. E.; Hands A. T.; Mekareeya A.; Doan M.; Gale J. P.; Derynck R.; Turbyville T.; Boerckel J. D.; Singh S.; Kiessling L. L.; Schwarz T. L.; Varelas X.; Wagner F. F.; Kafri R.; Eisinger-Mathason T. S.; Carpenter A. E. Virtual screening for small-molecule pathway regulators by image-profile matching. Cell Syst. 2022, 13, 724–736. 10.1016/j.cels.2022.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schölermann B.; Bonowski J.; Grigalunas M.; Burhop A.; Xie Y.; Hoock J. G. F.; Liu J.; Dow M.; Nelson A.; Nowak C.; Pahl A.; Sievers S.; Ziegler S. Identification of Dihydroorotate Dehydrogenase Inhibitors Using the Cell Painting Assay. ChemBioChem 2022, 23, e202200475 10.1002/cbic.202200475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akbarzadeh M.; Deipenwisch I.; Schoelermann B.; Pahl A.; Sievers S.; Ziegler S.; Waldmann H. Morphological profiling by means of the Cell Painting assay enables identification of tubulin-targeting compounds. Cell Chem. Biol. 2022, 29, 1053–1064. 10.1016/j.chembiol.2021.12.009. [DOI] [PubMed] [Google Scholar]
- Hughes R. E.; Elliott R. J. R.; Li X.; Munro A. F.; Makda A.; Carter R. N.; Morton N. M.; Fujihara K.; Clemons N. J.; Fitzgerald R.; O’Neill J. R.; Hupp T.; Carragher N. O. Multiparametric High-Content Cell Painting Identifies Copper Ionophores as Selective Modulators of Esophageal Cancer Phenotypes. ACS Chem. Biol. 2022, 17, 1876–1889. 10.1021/acschembio.2c00301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marin Zapata P. A.; Mendez-Lucio O.; Le T.; Beese C. J.; Wichard J.; Rouquie D.; Clevert D.-A. Cell morphology-guided de novo hit design by conditioning GANs on phenotypic image features. Digital Discovery 2023, 2, 91. 10.1039/D2DD00081D. [DOI] [Google Scholar]
- Seal S.; Carreras-Puigert J.; Trapotsi M.-A.; Yang H.; Spjuth O.; Bender A. Integrating cell morphology with gene expression and chemical structure to aid mitochondrial toxicity detection. Commun. Biol. 2022, 5, 858. 10.1038/s42003-022-03763-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trapotsi M.-A.; Mouchet E.; Williams G.; Monteverde T.; Juhani K.; Turkki R.; Miljković F.; Martinsson A.; Mervin L.; Pryde K. R.; Müllers E.; Barrett I.; Engkvist O.; Bender A.; Moreau K. Cell Morphological Profiling Enables High-Throughput Screening for Proteolysis TArgeting Chimera (PROTAC) Phenotypic Signature. ACS Chem. Biol. 2022, 17, 1733–1744. 10.1021/acschembio.2c00076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Way G. P.; Kost-Alimova M.; Shibue T.; Harrington W. F.; Gill S.; Piccioni F.; Becker T.; Shafqat-Abbasi H.; Hahn W. C.; Carpenter A. E.; Vazquez F.; Singh S. Predicting cell health phenotypes using image-based morphology profiling. Mol. Biol. Cell 2021, 32, 995–1005. 10.1091/mbc.E20-12-0784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rietdijk J.; Aggarwal T.; Georgieva P.; Lapins M.; Carreras-Puigvert J.; Spjuth O. Morphological profiling of environmental chemicals enables efficient and untargeted exploration of combination effects. Sci. Total Environ. 2022, 832, 155058. 10.1016/j.scitotenv.2022.155058. [DOI] [PubMed] [Google Scholar]
- Young D. W.; Bender A.; Hoyt J.; McWhinnie E.; Chirn G. W.; Tao C. Y.; Tallarico J. A.; Labow M.; Jenkins J. L.; Mitchison T. J.; Feng Y. Integrating high-content screening and ligand-target prediction to identify mechanism of action. Nat. Chem. Biol. 2008, 4, 59–68. 10.1038/nchembio.2007.53. [DOI] [PubMed] [Google Scholar]
- Tian G.; Harrison P. J.; Sreenivasan A. P.; Carreras Puigvert J.; Spjuth O. Combining molecular and Cell Painting image data for mechanism of action prediction. bioRxiv Preprint 2022, 10.1101/2022.10.04.510834. [DOI] [Google Scholar]
- Simm J.; Klambauer G.; Arany A.; Steijaert M.; Wegner J. K.; Gustin E.; Chupakhin V.; Chong Y. T.; Vialard J.; Buijnsters P.; Velter I.; Vapirev A.; Singh S.; Carpenter A. E.; Wuyts R.; Hochreiter S.; Moreau Y.; Ceulemans H. Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery. Cell Chem. Biol. 2018, 25, 611–618. 10.1016/j.chembiol.2018.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hofmarcher M.; Rumetshofer E.; Clevert D.-A.; Hochreiter S.; Klambauer G. Accurate Prediction of Biological Assays with High-Throughput Microscopy Images and Convolutional Networks. J. Chem. Inf. Model. 2019, 59, 1163–1171. 10.1021/acs.jcim.8b00670. [DOI] [PubMed] [Google Scholar]
- Trapotsi M.-A.; Mervin L. H.; Afzal A. M.; Sturm N.; Engkvist O.; Barrett I. P.; Bender A. Comparison of Chemical Structure and Cell Morphology Information for Multitask Bioactivity Predictions. J. Chem. Inf. Model. 2021, 61, 1444–1456. 10.1021/acs.jcim.0c00864. [DOI] [PubMed] [Google Scholar]
- Nassiri I.; McCall M. N. Systematic exploration of cell morphological phenotypes associated with a transcriptomic query. Nucleic Acids Res. 2018, 46, e116 10.1093/nar/gky626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Way G. P.; Natoli T.; Adeboye A.; Litichevskiy L.; Yang A.; Lu X.; Caicedo J. C.; Cimini B. A.; Karhohs K.; Logan D. J.; Rohban M. H.; Kost-Alimova M.; Hartland K.; Bornholdt M.; Chandrasekaran S. N.; Haghighi M.; Weisbart E.; Singh S.; Subramanian A.; Carpenter A. E. Morphology and gene expression profiling provide complementary information for mapping cell state. Cell Syst. 2022, 13, 911–923. 10.1016/j.cels.2022.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Willis C.; Nyffeler J.; Harrill J. Phenotypic Profiling of Reference Chemicals across Biologically Diverse Cell Types Using the Cell Painting Assay. SLAS Discovery 2020, 25, 755–769. 10.1177/2472555220928004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneckener S.; Grimbs S.; Hey J.; Menz S.; Osmers M.; Schaper S.; Hillisch A.; Goller A. H. Prediction of Oral Bioavailability in Rats: Transferring Insights from in Vitro Correlations to (Deep) Machine Learning Models Using in Silico Model Outputs and Chemical Structure Parameters. J. Chem. Inf. Model. 2019, 59, 4893–4905. 10.1021/acs.jcim.9b00460. [DOI] [PubMed] [Google Scholar]
- Obrezanova O.; Martinsson A.; Whitehead T.; Mahmoud S.; Bender A.; Miljković F.; Grabowski P.; Irwin B.; Oprisiu I.; Conduit G.; Segall M.; Smith G. F.; Williamson B.; Winiwarter S.; Greene N. Prediction of In Vivo Pharmacokinetic Parameters and Time-Exposure Curves in Rats Using Machine Learning from the Chemical Structure. Mol. Pharmaceutics 2022, 19, 1488–1504. 10.1021/acs.molpharmaceut.2c00027. [DOI] [PubMed] [Google Scholar]
- Miljković F.; Martinsson A.; Obrezanova O.; Williamson B.; Johnson M.; Sykes A.; Bender A.; Greene N. Machine Learning Models for Human In Vivo Pharmacokinetic Parameters with In-House Validation. Mol. Pharmaceutics 2021, 18, 4520–4530. 10.1021/acs.molpharmaceut.1c00718. [DOI] [PubMed] [Google Scholar]


