Abstract
DNA encoded libraries (DELs) allow identifying starting chemical matter in drug discovery. The volume of experimental data generated also makes DELs an attractive resource for machine learning (ML). ML allows modeling complex relationships between compounds and numerical endpoints, like the binding to a target measured by DELs. DELs could also empower other areas of drug discovery. We propose that DELs and ML could be combined to model binding to off-targets, enabling better predictive toxicology. With enough data, ML models can make accurate predictions across a vast chemical space, and they can be reused and expanded across projects. Although there are limitations, more general toxicology models could be applied earlier in drug discovery, illuminating safety liabilities at a lower cost.
Keywords: cheminformatics, DNA encoded libraries, machine learning, deep learning, toxicology, safety pharmacology
Teaser.
Augmenting DNA encoded libraries with machine learning could enable predictive models with unprecedented accuracy and generality. These models need not be limited to virtual screening; they could also help address other challenges in drug discovery.
Introduction
Drug discovery is an expensive and lengthy enterprise which has been estimated to cost billions of dollars per drug 1. Nonclinical toxicology liabilities identified in vitro or in vivo are recognized as the top reason for compound attrition 2, especially promiscuous binding to off-targets 3. Considering relevant safety endpoints earlier in the drug discovery process and measuring or predicting novel endpoints that recapitulate biological complexity better could enable significant advances in the field 4,5. In this Feature article, we highlight some advances in the field of drug discovery and predictive toxicology and speculate about potential new experiments and directions for the field, with the goal of stimulating debate and innovation at the intersection of machine learning and DNA encoded libraries.
For decades, computational models have been used in drug discovery for target discovery, lead optimization and beyond 6. One area that has attracted particular attention is structure-based virtual screening (SBVS), that is, predicting which compounds from large libraries bind to protein targets of interest. Although the performance of traditional molecular docking was not comparable with experimental high-throughput screening (HTS) 7,8, substantial improvements are being made with the help of machine learning (ML) 9,10.
Computational toxicology is challenged by the small datasets available 11,12 and the general lack of in vivo models 13. Although more data for toxicity models is available in the pharmaceutical industry than in the public domain 3,14,15, it is limited to scaffolds of interest to the company in question and is rarely made public. Many algorithms have been applied to various drug toxicology datasets and with different molecular encodings or representations. Often times, several algorithms perform equivalently, and the best predictive models are not newer algorithms such as deep learning approaches 15. Traditional expert systems, used to filter out molecules with unstable, toxic, or promiscuous substructures, are potentially useful 16,17. Scarcity of data and diversity of molecule structures are major bottlenecks in predictive toxicology today 11,12, and models for novel endpoints and with broader applicability are needed.
Recently, DNA-encoded library (DEL) technology has gained popularity as an experimental screening tool. DEL technology consists of binding assays to a target in the presence of a pooled library of test compounds. Each compound is tagged with a DNA sequence that can be sequenced and confidently mapped to the identity of the compound 18,19. Recently, strategies for functional or even phenotypic DEL assays have also been developed to identify not just binding but biological activity 20. In a common format, tagged compounds are incubated with the immobilized target, and non-binders are carefully washed off 19, as in other affinity-selection techniques 21. Crucially, compounds binding to the target are identified en masse, cost-effectively, and with very high sensitivity thanks to the combination of PCR amplification and next-generation sequencing of the binders’ tags. Key experimental considerations for carrying out DEL selections are discussed elsewhere 19. One point worth highlighting is the value of carrying out a pre-selection sequencing of the library, which allows correcting for any potential tag imbalance 22.
To date, much of the focus on DEL library space has been on the creation of libraries using tools such as eDesigner, focusing on the reactions, physicochemical properties of the building blocks, and chemical diversity 23. Although the DEL selection signal is very noisy 22,24,25, the large volume of data and the power of ML for signal processing could enable the ranking of individual compounds in the library based on binding affinity and to predict the binding ability of compounds far from the library space 26,27. However, this may run into difficulties in terms of distance of the DEL molecules from the model training sets and the general applicability of the models used based on their chemical property space covered.
We propose that the combination of DEL and ML could also lead to major advances elsewhere in the drug discovery pipeline that are not currently appreciated. As discussed above, toxicology often illuminates roadblocks in the study of medicinal compounds. Here, we propose some strategies to generate additional data and models useful in predictive toxicology. Abundant data on small-molecule promiscuity and other types of endpoints could be generated using DEL technology. In turn, such data could be used to train predictive models in a supervised or semi-supervised manner. This approach could overcome some of the limitations of current toxicology models. The goal would be to complement existing models and assays that focus on different compound properties and toxicity endpoints, so that they become applicable at a larger scale and earlier in the drug discovery process.
DELs and ML for traditional toxicology endpoints
In the human body, a drug may interact with many organs, cell types, and target proteins (transporters, enzymes, ion channels, GPCRs, etc.). The interactions with these non-targets can reduce the effective drug concentration and cause side effects. In vitro safety pharmacology profiling is frequently performed using commercial products like the SafetyScreen44 28. This panel includes 38 binding assays and 6 functional assays of well-established targets and pathways leading to off-target adverse drug reactions. It would be useful to be able to virtually screen compounds for potential binding to such targets.
As a proof of concept, we curated datasets from ChEMBL 29 on 42 of the 44 targets used in the SafetyScreen44 in vitro screen (Table S1) and built a multi-task neural network that predicts compound -log(Molar) binding affinity to them (see Supporting Information for details). The embedding layer in the model takes tokenized SMILES strings directly as inputs, which allows parallelizing training and inference on GPUs in an end-to-end manner. This leads to a significant speedup when working with large DELs compared to the sequential encoding of the input SMILES as hashed fingerprints (Figure 1A). We used a stratified random splitting strategy to ensure all targets had representation in the training and test sets for downstream analysis. The model had an excellent performance on a hold-out test set of compounds (Figure 1B, S1). We also trained a similar LSTM model with a fingerprint-based encoding of the molecules. The model showed nearly identical performance (Table S2), suggesting the increase in inference speed did not come at a cost in model performance. The size of the training dataset for each target had no correlation with the test root-mean squared error (RMSE, Figure 1C), indicating the model was not overfitting on targets with the most training points and that it was leveraging the multi-task architecture to learn to generalize across endpoints. For a more stringent analysis of performance, we also retrained the ConvLSTM model on datasets split by Murcko scaffold (Figure S1, Table S4). The figures of merit are more conservative as they reflect performance on chemical space further away from that used to train the model. Still, the model performs reasonably well in general (albeit with a higher variability in individual target performance).
Figure 1.
Multi-task ML is particularly well-suited to model multiple toxicology endpoints. A ConvLSTM model was built to predict 42 toxicity endpoints (IC50 values) based on in vitro data. (A) Estimated time for fingerprint calculation for one billion molecules using rdkit’s GetMorganFingerprintAsBitVector function compared with the custom SMILES tokenization pre-processing used in our ConvLSTM model. (B) Parity plot for 42 mixed targets, showing predicted vs. actual -log(Molar) values for a test-set of compounds, with at least 10 datapoints for each of the 42 toxicity targets (see Table S1). RMSE and R2 shown for the combined predictions. (C) RMSE vs. size of the training data for each target. The highest RMSE (red, CCK1) and lowest RMSE (Serotonin 5HT1B, blue) are highlighted, along with the toxicity target with the largest training set (hERG, pink). (D) t-SNE plot of chemistry space (input: Morgan Fingerprints of radius 3, 1024 bits) showing overlap of the DOS-DEL-1 and combined training sets of the 42 tox targets used to build the model.
We then used the ConvLSTM model to predict these endpoints across the DOS-DEL-1 library, a DEL consisting of 108,526 molecules 30. We make these predictions available as an example of virtual toxicity screening results (file TableS3.csv). A t-SNE plot (Figure 1D) shows very little overlap between the DOS-DEL-1 and ChEMBL training datasets. This indicates that generating experimental toxicology data for a DEL like this would greatly expand the chemical space covered by the model, as these molecules are not in the model training set, potentially improving its accuracy and generalizability.
Although, relatively small datasets are publicly available for predictive toxicology (only a few hundred for most endpoints), this example shows that ML can deliver reasonable performance across much larger datasets, especially when multi-task learning is leveraged. Combining multiple related predictive tasks under a single model architecture can lead to superior predictions and generalizability 31. Moreover, the predictions in Table S3 indicate that a sizeable proportion of compounds in the DEL library may have potential safety liabilities, highlighting the value of this type of model. We also expect that the predictive performance and generalizability of the model across all endpoints would increase substantially with much larger datasets.
DEL technology could be applied to generate abundant data (hundreds of thousands of datapoints per target), enabling more general and accurate predictive models of toxicity endpoints that could be used in many projects over time. The modeling of DEL data is receiving increasing interest, both to denoise the signal and to make predictions for compounds beyond the DEL 26,27. ML models, and particularly neural networks, are flexible enough to model arbitrarily complex functions, such as the binding of a small molecule to a protein target or collection of targets. While a DEL is traditionally selected against a single target, parallel target screening is possible on upwards of 100 targets 32. It is feasible, then, for DELs to generate datasets for modeling key targets like those in the SafetyScreen44 and other preclinical safety pharmacology studies 33.
The input to the model can be an encoding of the compound whose properties (e.g., promiscuity) one aims to predict. Conventional molecular descriptors and fingerprints have been used for decades for molecular encoding and perform very well in the small data regime 34. More recent approaches, such as string-based (e.g., SMILES, as in our model above) and graph-based end-to-end representations, are well-suited for very large datasets 35. In between those extremes, embeddings pre-trained from large and diverse sets of unlabeled compounds, including unsupervised 36 and self-supervised 37 representations, offer a powerful starting point for modeling DEL data.
Importantly, some machine learning models can provide not just a prediction, but also a measure of their uncertainty or confidence in the predictions 38,39. The confidence and applicability domain of the model is affected by many factors, including the complexity of the relationships being modeled, the flexibility of the model family chosen, and the training datapoints available to constrain it across chemical space, and the noise in the data. With an estimate of uncertainty, one can make an informed decision on how much to trust a model’s predictions. The “promiscuity models” proposed could help identify toxicity liabilities early in the drug discovery process, virtually screening billions of molecules. The same models could also be used in lead optimization to reduce the chances that chemical modifications increase promiscuity. As with other types of models, they should be validated by in vitro or in vivo testing before being deployed to confirm their reliability.
DELs and ML for novel endpoints
DEL technology is generally applied to a single protein target at a time. As proposed in the previous section, these targets could be toxicology endpoints already in use. We further propose that DELs might be used to interrogate compounds for their binding to collections of proteins (Figure 2). For instance, protein extracts from a cell type, tissue or organ could be used. It might also be possible to capture multiple targets from adverse outcome pathways, which may be relevant for carcinogenicity and neurotoxicity prediction in particular 40. This approach might enable a more cost-efficient assessment of binding across a variety of targets and provide novel endpoints of relevance.
Figure 2.
Combining DELs and ML may provide novel endpoints for predictive toxicology. For instance, the compression of multiple targets in the same DEL could enable a cost-efficient screening for promiscuity. Native proteins can be extracted from human organs, tissues, or cells. These proteins can be used in-solution (A) or be immobilized (B) for DEL selections. (A) The protein extract may be incubated with the DEL. After incubation, a chemically reactive DNA probe (capture probe; a photo-crosslinker diazirine is shown as an example), which is complementary to the common primer-binding site of the library, is added. UV irradiation then triggers the covalent capture of the target and a primer extension step copies the DNA code 47. The protein-DNA conjugates may be purified by protein extraction or using a built-in biotin group in the capture probe, followed by PCR amplification and DNA sequencing. (B) The proteins are immobilized on beads, and the protein-coated beads are incubated with the DEL. After careful washing of non-binders, bound library members are eluted, PCR-amplified, and sequenced. In both formats, the DNA sequences confidently map to the chemical identity of the compounds. Given the diversity of proteins, the bound compounds identified from the DEL likely represent promiscuous binders for that specific protein mix (see Figure 4). The large amounts of data generated from DELs can be modeled using ML. Different endpoints (e.g., promiscuity across different targets, cell types or tissues) can be modeled simultaneously in a multi-task ML model. The model can then be used to inspect large chemical libraries and remove or filter compounds with potential safety liabilities early in the drug discovery pipeline.
Methods for tissue homogenization and gentle protein extraction in the presence of enzyme inhibitors are well established in molecular biology 41. It is also possible to acquire lyophilized protein extracts from a variety of human organs and tissues (such as kidney, lung, heart, liver, etc.), which preserve many proteins in their functional state. If desired, one could deplete common proteins in such extracts, like albumin and fibrinogen, using enzymatic or separation methods 42,43.
Two main DEL selection formats are available, with the target proteins being immobilized or remaining in solution (Figure 2). Recombinantly expressed proteins often have an affinity tag (e.g., His-tag) that can be conveniently used for immobilization. Moreover, different chemistries are available to immobilize proteins more generally. Magnetic beads functionalized with N-hydroxysuccinimide (NHS) esters or tosyl groups mainly react with terminal amino group of proteins; epoxy groups react with thiol or histidine groups on the proteins’ surface 44, while glutaraldehyde can react with a variety of functional groups 45. Immobilization allows the facile separation of the targets with an external magnet. It also separates proteins from each other, reducing aggregation and protein-protein interactions. Moreover, some methods may be compatible with proteolysis after DEL selection 46, allowing a proteomics assessment of the proteins that were captured.
Alternatively, in-solution DEL approaches may be better suited to assess complex biological mixtures as they eliminate potential biases introduced by the immobilization. These types of approaches might also enable selections with actual cells, which, for example, could recapitulate better membrane-associated targets. The complexity then becomes the separation of binders from non-binders without washing. This may be achieved by i) labeling the protein targets with a tag to preferentially prime the amplification of binders’ labels, ii) selectively degrading the tags of unbound library members, or iii) separating bound complexes (covalently stabilized or not) from unbound library members 20. Approach iii) seems most promising for dealing with multiple enzymes simultaneously. Since covalent crosslinking-based techniques do not require modification of the proteins 47-49, cell lysates, tissue samples and organ extracts may be directly subjected to the assay.
In the examples above, we suggested the possibility of compressing a complex mixture of targets in a DEL assay, which could provide a novel promiscuity endpoint for that mix of targets. Although few examples have been reported in the open literature in which multiple targets are pooled in a DEL50,51, this may change as the technology becomes more widespread. This approach could also enable new opportunities in the discovery of bispecific therapeutics. Furthermore, DELs may enable the generation of data for other relevant ADMET endpoints. For example, cell permeability is notoriously difficult to assess and predict 5,52. It may be possible to prepare artificial liposomes or “ghost cells” 53, incubate them with DELs, and separate the liposomes from the supernatant by centrifugation or sieving (Figure 3). By sequencing the library members in the supernatant and in the liposomes, one could evaluate and model the partitioning of the library members between the liposomes and the external medium. The DNA tag is a significant cargo and may affect the partitioning 48, which nonetheless could offer opportunities for the discovery of cell-penetrating compounds. It may be possible to mitigate the effect of the DNA tag on the permeability of the small-molecule head, for example, by introducing a “membrane-friendly” linker between the DNA tag and the small-molecule head. This may allow library members to partition in a “head in, tail out” disposition (Figure 2), avoiding the requirement to translocate the DNA tag. One might even prepare vesicles with compositions representing different cell types. The ability to model and predict these novel endpoints might improve the ADMET profile of candidate compounds in drug discovery and empower the design of more specific drugs.
Figure 3.
DELs might provide novel opportunities for generating data and modeling key ADMET endpoints, such as compound promiscuity and cell permeability. DELs could be incubated with liposomes. The DNA tag may be a significant cargo, and this may allow identifying cell-penetrating compounds from the liposomes (left). Alternatively, a hydrophobic linker might be used to reduce the impact of the DNA tag on the permeability of the small-molecule head (right). This could allow a better assessment of permeability by measuring what fraction of each library member is retained inside the liposomes or on the membrane.
Outstanding challenges
A variety of toxicity prediction models have been proposed over the years, most notably based on expert systems and traditional ML 12,15,17,11. Most are useful on drug-like molecules because that is the stage at which training data was generated. DELs could provide larger datasets to increase model applicability and provide novel toxicity endpoints that can be predicted. However, there are limitations.
Firstly, toxicology is not limited to promiscuity, as there are idiosyncratic toxicities which are very difficult to predict 54. Still, even if promiscuity does not always imply toxicity, it often does 3. Also, higher promiscuity may require higher doses of compound to achieve a therapeutic effect, thus increasing the chances of adverse effects 55.
Secondly, while the proposed strategy of pooling several targets in the same DEL assay could reduce costs (associated with purifying individual proteins and performing multiple DEL experiments), this format would also reduce the effective concentration of each protein. The effective concentration of protein in a DEL selection experiment sets a cutoff on the binding affinity of the ligands that will be positively detected (Figure 3). Thus, pooling several targets would reduce the ability to detect specific binders to individual proteins in favor of promiscuous binders to a range of proteins (Figure 3). The covalent crosslinking-based selection method mentioned above may increase the sensitivity of detecting promiscuous binders by irreversibly trapping the protein-ligand conjugates. On the other hand, there are methods to make the relative concentrations of proteins in protein extracts more uniform 42,43, but their relative abundance may not reflect their relative toxicological relevance. Also, membrane proteins may not be well represented, depending on the format of the DEL.
Thirdly, quantitative modeling of DEL results is challenging. The raw signal from DEL selections is noisy and may correlate poorly with binding affinity 25. Despite this challenge, significant advances are being made in denoising and extracting quantitative relationships from DEL results 22,26,27. Moreover, although DEL libraries are deep, they are not necessarily broad. DEL libraries are generally built from limited sets of scaffolds and building blocks (103-104) which are combinatorically reacted in a few cycles, typically 3. Moreover, the chemistry used for DEL synthesis needs to be compatible with DNA 56. Thus, DEL libraries may cover a narrow region of the drug-like chemical space, which might affect the applicability domain of models learned from them. Still, recent works are showing a remarkable ability to generalize from DEL data 27. The ability of models to generalize could also benefit from robust chemical representations that leverage large amounts of unlabeled chemical data, as well as from multi-task learning. It is also becoming increasingly common to run multiple DEL libraries against a target of interest, which would directly increase coverage. New data could also be incorporated into such ML models over time, as it is currently done for other predictive toxicology models 3,14,15.
Conclusions and outlook
Compound promiscuity is a cause of toxicity, which in turn leads to attrition in small-molecule drug discovery. This is normally assessed once lead compounds are optimized, using experimental binding assays on tens of selected off-target proteins 33. This approach is not exhaustive, as toxicity associated with binding to other off-targets is often encountered in later preclinical cellular and animal models. Here, we propose a strategy to generate models for small-molecule promiscuity prediction. These models could be deployed rapidly and inexpensively at the early stages of drug discovery to screen large libraries of compounds and complement existing experimental tests.
To produce more general predictive toxicology models, we propose to combine two recent technologies: DEL and ML. This approach could be applied to important off-targets, such as those considered in safety pharmacology profiling. We also discuss some possibilities for DEL and ML to predict novel potential ADMET endpoints, such as membrane permeability and undesired binding to collections of proteins. By pooling not just the library compounds but also the protein targets, the requirements on binding affinity and effective protein concentration are more stringent. This means that the DEL will naturally enrich for promiscuous binders to a variety of protein targets. This could be a cost-effective strategy to span many potential off-targets in one shot, while models for individual protein targets are built in the longer term. Once the models are trained, they can become part of the company’s assets, and be reused and expanded over time. ML has proved capable of generalizing binding affinity patterns from high-throughput screening 6,57 and is now making progress in denoising and generalizing quantitative trends from DELs 26,27,58. Although the models will not be perfect, they may contribute to reduce attrition rates in drug discovery.
Supplementary Material
Figure 4.
Relationships between compound recovery, target concentration ([P]total), individual ligand concentration ([L]total), and binding affinity (Kd) in DEL selections. [P]total, [L]total, and Kd have the same units and are displayed in logarithmic concentration units. The simulation considers the recovery achieved after a single ideal equilibration step, with a simple association-dissociation equilibrium (see Supporting Information for details). Since a DEL assay involves washing steps, we only expect compounds with high recoveries (>90%) to be identified as positives. The simulation indicates that the total protein concentration should be set considerably higher than that of the individual ligands to achieve a high recovery of tight binders (A, C). The results also indicate that, if multiple ligands can compete for the same active site, the total target concentration should be higher than the sum of all of them to enable a high recovery. In other words, the protein concentration affects the stringency of the recovery, such that the lower the protein concentration, the higher the binding affinity of the compound will have to be for it to be positively observed in the DEL (B). As a first approximation, only compounds with Kd ≪ [P]total are expected to be recovered.
Acknowledgements
The authors thank Gerald Kolodny (Harvard Medical School, USA), Alex Satz (WuXi AppTec, Germany), and Jose Manuel Guisan (Institute of Catalysis and Petrochemistry, Spain) for helpful discussions. The authors also thank Piyanut Pinyou (Suranaree University of Technology, Thailand) for help with the figures and Kimberley Zorn for the earlier Assay Central support. Figures 2 and 3 were created with Biorender.com.
This work was supported in part by NIH funding to S.E.: R44GM122196-02A1 from NIGMS, 3R43AT010585-01S1 from NCCAM, R43DA055419-01 from NIDA, 1R43ES031038-01 from NIEHS. Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under Award Number R43ES031038. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Conflict of Interest
S.E. is owner, and J.G. and F.U. are employees of Collaborations Pharmaceuticals, Inc.
References
- 1.Wouters OJ, McKee M, Luyten J. Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. JAMA. 2020;323(9):844–853. doi: 10.1001/jama.2020.1166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Waring MJ, Arrowsmith J, Leach AR, et al. An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nat Rev Drug Discov. 2015;14(7):475–486. doi: 10.1038/nrd4609 [DOI] [PubMed] [Google Scholar]
- 3.Rao MS, Gupta R, Liguori MJ, et al. Novel Computational Approach to Predict Off-Target Interactions for Small Molecules. Frontiers in Big Data. 2019;2. Accessed April 4, 2022. https://www.frontiersin.org/article/10.3389/fdata.2019.00025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Avila AM, Bebenek I, Bonzo JA, et al. An FDA/CDER perspective on nonclinical testing strategies: Classical toxicology approaches and new approach methodologies (NAMs). Regul Toxicol Pharmacol. 2020;114:104662. doi: 10.1016/j.yrtph.2020.104662 [DOI] [PubMed] [Google Scholar]
- 5.Bender A, Cortés-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. Drug Discovery Today. 2021;26(2):511–524. doi: 10.1016/j.drudis.2020.12.009 [DOI] [PubMed] [Google Scholar]
- 6.Ekins S, Puhl AC, Zorn KM, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat Mater. 2019;18(5):435–441. doi: 10.1038/s41563-019-0338-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Blay V, Tolani B, Ho SP, Auer M, Arkin MR. High-Throughput Screening: today’s biochemical and cell-based approaches. Drug Discov Today. 2020;(Journal Article). [DOI] [PubMed] [Google Scholar]
- 8.Warren GL, Andrews CW, Capelli AM, et al. A Critical Assessment of Docking Programs and Scoring Functions. J Med Chem. 2006;49(20):5912–5931. doi: 10.1021/jm050362n [DOI] [PubMed] [Google Scholar]
- 9.Li H, Sze KH, Lu G, Ballester PJ. Machine-learning scoring functions for structure-based virtual screening. Wires Comput Mol Sci. 2020;(Journal Article):e1478. [Google Scholar]
- 10.McGibbon M, Money-Kyrle M, Blay V, Houston DR. SCORCH: Improving virtual screening with a consensus of machine learning classifiers, data augmentation, and uncertainty estimation. Journal of Advanced Research. Published online Under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang MWH, Goodman JM, Allen TEH. Machine Learning in Predictive Toxicology: Recent Applications and Future Directions for Classification Models. Chem Res Toxicol. 2021;34(2):217–239. doi: 10.1021/acs.chemrestox.0c00316 [DOI] [PubMed] [Google Scholar]
- 12.Basile AO, Yahi A, Tatonetti NP. Artificial Intelligence for Drug Toxicity and Safety. Trends in Pharmacological Sciences. 2019;40(9):624–635. doi: 10.1016/j.tips.2019.07.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bhhatarai B, Walters WP, Hop CECA, Lanza G, Ekins S. Opportunities and challenges using artificial intelligence in ADME/Tox. Nat Mater. 2019;18(5):418–422. doi: 10.1038/s41563-019-0332-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Göller AH, Kuhnke L, Montanari F, et al. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discovery Today. 2020;25(9):1702–1709. doi: 10.1016/j.drudis.2020.07.001 [DOI] [PubMed] [Google Scholar]
- 15.Lane TR, Foil DH, Minerali E, Urbina F, Zorn KM, Ekins S. Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery. Mol Pharmaceutics. 2021;18(1):403–415. doi: 10.1021/acs.molpharmaceut.0c01013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Baell JB, Holloway GA. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J Med Chem. 2010;53(7):2719–2740. doi: 10.1021/jm901137j [DOI] [PubMed] [Google Scholar]
- 17.Foster RS, Fowkes A, Cayley A, et al. The importance of expert review to clarify ambiguous situations for (Q)SAR predictions under ICH M7. Genes and Environment. 2020;42(1):27. doi: 10.1186/s41021-020-00166-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zimmermann G, Neri D. DNA-encoded chemical libraries: foundations and applications in lead discovery. Drug Discovery Today. 2016;21(11):1828–1834. doi: 10.1016/j.drudis.2016.07.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Satz AL, Brunschweiger A, Flanagan ME, et al. DNA-encoded chemical libraries. Nat Rev Methods Primers. 2022;2(1):1–17. doi: 10.1038/s43586-021-00084-5 [DOI] [Google Scholar]
- 20.Huang Y, Li Y, Li X. Strategies for developing DNA-encoded libraries beyond binding assays. Nat Chem. 2022;14(2):129–140. doi: 10.1038/s41557-021-00877-x [DOI] [PubMed] [Google Scholar]
- 21.Blay V, Otero-Muras I, Annis DA. Solving the Competitive Binding Equilibria between Many Ligands: Application to High-Throughput Screening and Affinity Optimization. Anal Chem. 2020;(Journal Article): 10.1021/acs.analchem.0c02715. [DOI] [PubMed] [Google Scholar]
- 22.Kómár P, Kalinić M. Denoising DNA Encoded Library Screens with Sparse Learning. ACS Comb Sci. 2020;22(8):410–421. doi: 10.1021/acscombsci.0c00007 [DOI] [PubMed] [Google Scholar]
- 23.Martín A, Nicolaou CA, Toledo MA. Navigating the DNA encoded libraries chemical space. Commun Chem. 2020;3(1):1–9. doi: 10.1038/s42004-020-00374-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Satz AL. DNA Encoded Library Selections and Insights Provided by Computational Simulations. ACS Chem Biol. 2015;10(10):2237–2245. doi: 10.1021/acschembio.5b00378 [DOI] [PubMed] [Google Scholar]
- 25.Satz AL. Simulated Screens of DNA Encoded Libraries: The Potential Influence of Chemical Synthesis Fidelity on Interpretation of Structure–Activity Relationships. ACS Comb Sci. 2016;18(7):415–424. doi: 10.1021/acscombsci.6b00001 [DOI] [PubMed] [Google Scholar]
- 26.Ma R, Dreiman GHS, Ruggiu F, et al. Regression modeling on DNA encoded libraries. In: ; 2021. Accessed April 2, 2022. https://openreview.net/forum?id=rrcoPmV1XgN [Google Scholar]
- 27.McCloskey K, Sigel EA, Kearnes S, et al. Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding. J Med Chem. 2020;63(16):8857–8866. doi: 10.1021/acs.jmedchem.0c00452 [DOI] [PubMed] [Google Scholar]
- 28.Bowes J, Brown AJ, Hamon J, et al. Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nat Rev Drug Discov. 2012;11(12):909–922. doi: 10.1038/nrd3845 [DOI] [PubMed] [Google Scholar]
- 29.Gaulton A, Hersey A, Nowotka M, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–D954. doi: 10.1093/nar/gkw1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Gerry CJ, Wawer MJ, Clemons PA, Schreiber SL. DNA Barcoding a Complete Matrix of Stereoisomeric Small Molecules. J Am Chem Soc. 2019;141(26):10225–10235. doi: 10.1021/jacs.9b01203 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sosnin S, Vashurina M, Withnall M, Karpov P, Fedorov M, Tetko IV. A Survey of Multi-task Learning Methods in Chemoinformatics. Molecular Informatics. 2019;38(4):1800108. doi: 10.1002/minf.201800108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Machutta CA, Kollmann CS, Lind KE, et al. Prioritizing multiple therapeutic targets in parallel using automated DNA-encoded library screening. Nat Commun. 2017;8(1):16081. doi: 10.1038/ncomms16081 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Azzaoui K, Hamon J, Faller B, et al. Modeling promiscuity based on in vitro safety pharmacology profiling data. ChemMedChem. 2007;2(6):874–880. doi: 10.1002/cmdc.200700036 [DOI] [PubMed] [Google Scholar]
- 34.Blay V, Radivojevic T, Allen JE, Hudson CM, Garcia Martin H. MACAW: An Accessible Tool for Molecular Embedding and Inverse Molecular Design. J Chem Inf Model. Published online July 20, 2022. doi: 10.1021/acs.jcim.2c00229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. Published online October 19, 2020. doi: 10.48550/arXiv.2010.09885 [DOI] [Google Scholar]
- 36.Jaeger S, Fulle S, Turk S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model. 2018;58(1):27–35. doi: 10.1021/acs.jcim.7b00616 [DOI] [PubMed] [Google Scholar]
- 37.Wang Y, Wang J, Cao Z, Barati Farimani A. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell. 2022;4(3):279–287. doi: 10.1038/s42256-022-00447-x [DOI] [Google Scholar]
- 38.Wang D, Yu J, Chen L, et al. A hybrid framework for improving uncertainty quantification in deep learning-based QSAR regression modeling. Journal of Cheminformatics. 2021;13(1):69. doi: 10.1186/s13321-021-00551-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Abdar M, Pourpanah F, Hussain S, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion. 2021;76:243–297. doi: 10.1016/j.inffus.2021.05.008 [DOI] [Google Scholar]
- 40.Leist M, Ghallab A, Graepel R, et al. Adverse outcome pathways: opportunities, limitations and open questions. Arch Toxicol. 2017;91(11):3477–3505. doi: 10.1007/s00204-017-2045-3 [DOI] [PubMed] [Google Scholar]
- 41.Feist P, Hummon AB. Proteomic Challenges: Sample Preparation Techniques for Microgram-Quantity Protein Analysis from Biological Samples. Int J Mol Sci. 2015;16(2):3537–3563. doi: 10.3390/ijms16023537 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fonslow BR, Stein BD, Webb KJ, et al. Digestion and depletion of abundant proteins improves proteomic coverage. Nat Methods. 2013;10(1):54–56. doi: 10.1038/nmeth.2250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Tu C, Rudnick PA, Martinez MY, et al. Depletion of Abundant Plasma Proteins and Limitations of Plasma Proteomics. J Proteome Res. 2010;9(10):4982–4991. doi: 10.1021/pr100646w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Grazú V, Abian O, Mateo C, Batista-Viera F, Fernández-Lafuente R, Guisán JM. Novel Bifunctional Epoxy/Thiol-Reactive Support to Immobilize Thiol Containing Proteins by the Epoxy Chemistry. Biomacromolecules. 2003;4(6):1495–1501. doi: 10.1021/bm034262f [DOI] [PubMed] [Google Scholar]
- 45.Migneault I, Dartiguenave C, Bertrand MJ, Waldron KC. Glutaraldehyde: behavior in aqueous solution, reaction with proteins, and application to enzyme crosslinking. BioTechniques. 2004;37(5):790–802. doi: 10.2144/04375RV01 [DOI] [PubMed] [Google Scholar]
- 46.Shah P, Zhang B, Choi C, et al. Tissue proteomics using chemical immobilization and mass spectrometry. Analytical Biochemistry. 2015;469:27–33. doi: 10.1016/j.ab.2014.09.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Shi B, Deng Y, Li X. Polymerase-Extension-Based Selection Method for DNA-Encoded Chemical Libraries against Nonimmobilized Protein Targets. ACS Comb Sci. 2019;21(5):345–349. doi: 10.1021/acscombsci.9b00011 [DOI] [PubMed] [Google Scholar]
- 48.Cai B, Kim D, Akhand S, et al. Selection of DNA-Encoded Libraries to Protein Targets within and on Living Cells. J Am Chem Soc. 2019;141(43):17057–17061. doi: 10.1021/jacs.9b08085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Huang Y, Meng L, Nie Q, et al. Selection of DNA-encoded chemical libraries against endogenous membrane proteins on live cells. Nat Chem. 2021;13(1):77–88. doi: 10.1038/s41557-020-00605-x [DOI] [PubMed] [Google Scholar]
- 50.Mendes KR, Malone ML, Ndungu JM, et al. High-throughput Identification of DNA-Encoded IgG Ligands that Distinguish Active and Latent Mycobacterium Tuberculosis Infections. ACS Chem Biol. 2017;12(1):234–243. doi: 10.1021/acschembio.6b00855 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Chan AI, McGregor LM, Jain T, Liu DR. Discovery of a Covalent Kinase Inhibitor from a DNA-Encoded Small-Molecule Library × Protein Library Selection. J Am Chem Soc. 2017;139(30):10192–10195. doi: 10.1021/jacs.7b04880 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Berben P, Bauer-Brandl A, Brandl M, et al. Drug permeability profiling using cell-free permeation tools: Overview and applications. European Journal of Pharmaceutical Sciences. 2018;119:219–233. doi: 10.1016/j.ejps.2018.04.016 [DOI] [PubMed] [Google Scholar]
- 53.Le QV, Lee J, Lee H, Shim G, Oh YK. Cell membrane-derived vesicles for delivery of therapeutic agents. Acta Pharmaceutica Sinica B. 2021;11(8):2096–2113. doi: 10.1016/j.apsb.2021.01.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Mosedale M, Watkins PB. Understanding Idiosyncratic Toxicity: Lessons Learned from Drug-Induced Liver Injury. J Med Chem. 2020;63(12):6436–6461. doi: 10.1021/acs.jmedchem.9b01297 [DOI] [PubMed] [Google Scholar]
- 55.Sameshima T, Yukawa T, Hirozane Y, et al. Small-Scale Panel Comprising Diverse Gene Family Targets To Evaluate Compound Promiscuity. Chem Res Toxicol. 2020;33(1):154–161. doi: 10.1021/acs.chemrestox.9b00128 [DOI] [PubMed] [Google Scholar]
- 56.Fitzgerald PR, Paegel BM. DNA-Encoded Chemistry: Drug Discovery from a Few Good Reactions. Chem Rev. 2021;121(12):7155–7177. doi: 10.1021/acs.chemrev.0c00789 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Dreiman GHS, Bictash M, Fish PV, Griffin L, Svensson F. Changing the HTS Paradigm: AI-Driven Iterative Screening for Hit Finding. SLAS DISCOVERY: Advancing the Science of Drug Discovery. 2021;26(2):257–262. doi: 10.1177/2472555220949495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Urbina F, Ekins S. The commoditization of AI for molecule design. Artificial Intelligence in the Life Sciences. 2022;2:100031. doi: 10.1016/j.ailsci.2022.100031 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




