Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 1.
Published in final edited form as: Drug Discov Today. 2021 Jan 14:S1359-6446(21)00005-2. doi: 10.1016/j.drudis.2021.01.005

Evaluating and evolving a screening library in academia: the St. Jude approach

Gisele Nishiguchi 1, Sourav Das 1, Jason Ochoada 1, Heather Long 1, Richard E Lee 1, Zoran Rankovic 1, Anang A Shelat 1
PMCID: PMC8131249  NIHMSID: NIHMS1699166  PMID: 33453364

Abstract

The quality of lead compounds is a key factor for determining the success of chemical probe and drug discovery programs. Since high-throughput screening (HTS) continues to be a dominant lead generation paradigm, access to high quality screening libraries is critical for such efforts in both industry and academia. Here, we discuss the strategy implemented a decade ago to build from scratch one of the largest compound collections in academia containing approximately 575,000 carefully annotated small molecules, and a recent multidisciplinary effort designed to further enhance the collection to meet our research demands for the next decade.

Keywords: Library design, chemical space, compound collection

Teaser

A high-quality screening library is the foundation of successful chemical probe and drug discovery programs. Here, we discuss our strategy to build a next generation screening library in academia.

Introduction

High-quality chemical probes are vital to biomedical research because they enable the interrogation of biological pathways and the validation of new targets [13]. Among the many screening paradigms available for initiating a chemical probe discovery project, conventional high-throughput screening (HTS) remains an attractive and effective strategy. Fundamental to an HTS campaign is a diverse collection of compounds that can be efficiently screened phenotypically or biochemically against a target of interest. Screening libraries represent a significant investment and major asset for research institutions and companies engaged in drug discovery. Ideally, it should be representative of biologically relevant chemical space, composed of chemically attractive compounds with tractable synthetic accessibility, and free of undesirable chemical functionalities.

Recognizing the value of a high-quality and well-maintained compound collection, we assembled a team of scientists responsible for the management, curation, and continuous enhancement of the screening library at St. Jude Children’s Research Hospital (SJCRH). Crucial to this effort was a comprehensive analysis and understanding of our existing library. Herein, we describe a detailed interrogation of the SJCRH compound collection from multiple perspectives. Using the knowledge obtained from this analysis, we developed a strategy to renew and expand our existing library. We document the approach and outcomes from recent compound acquisitions. Finally, we compare our molecules to the chemical space of commercially available compound libraries, the CHEMBL database, Dark Matter, and patents to identify gaps and opportunities for enhancement that will guide the development of our chemical library for the next decade.

The SJCRH Compound Collection

The mission of SJCRH is to advance cures, and means of prevention, for pediatric catastrophic diseases through research and treatment. The development of chemical probes amenable to both in vitro and in vivo proof-of-concept studies is the primary aim of our programs and the identification of hits suitable for optimization is the main goal of our screening campaigns. Since 2005, we have acquired approximately 575,000 unique small molecules, 90% of which were purchased from multiple vendors in the first 3 years. Our initial design strategy emphasized adherence to Lipinski’s Rule of Five (RO5) [4], elimination of PAINS [5] and other compounds with suspect chemical moieties, and maximization of diversity at the scaffold level while sampling multiple analogs per scaffold. Since 2009, the collection has been augmented with compounds from in-house parallel synthesis, analogs of internal medicinal chemistry projects, focused libraries designed for specific targets or target classes, fragments, natural products, approved drugs, and chemical probes from the literature.

Quality Control (QC).

The compound collection is managed using an automated storage system (Brooks Life Sciences) that currently holds over 4 million sample tubes as DMSO solutions stored at −20 °C (Fig. 1A). The 384-way tubes (10 μL volume, single use) are primarily used to cherry-pick HTS screening hits for dose-response confirmation. The 96-way tubes (≥100 μL) serve as long-term reservoirs and are typically used to replenish the 384-way tubes once they are depleted. Historically, compounds were not internally checked for purity or identity at the time of purchase and we relied primarily on QC data provided by the vendor. To confirm the integrity of our compounds after several years of storage, we experimentally determined the compound purity and identity of a representative subset of the collection from both storage formats. To select from the 96-way tubes, compounds were randomly picked from a structurally diverse set of scaffolds with drug-like properties (molecular weight [“mw”] 200–500 Da and calculated logP [“clogP”] < 5), yielding 523 compounds. For the 384-way tubes, compounds with >15 cherry-picks remaining were randomly selected, providing 256 compounds covering a wider range of physicochemical property space (mw 150–600 Da and clogP < 6). These 779 compounds were then submitted for quality control assessment. Liquid chromatography–mass spectrometry (LCMS) data was obtained using a ultra-performance liquid chromatography system equipped with ultraviolet and evaporative light scattering detectors and purity was calculated as the average of the two detection methods [6]. Identity was confirmed by mass spectrometry. Despite several years in storage, we found that 77.8% (N = 606) of the compounds in our test set had >90% purity and an additional 9.6% (N = 75) were within 80–90% purity (Fig. 1B). We found little difference in quality between compounds stored in either tube format, and no significant correlation between purity and mw, clogP, or the time since acquisition (Supplementary Fig. S1). Overall, 87.4% (N = 681) of our compounds passed our QC criteria of >80% purity, indicating that our current collection was still useable for screening campaigns. These results were encouraging and comparable to those reported by GSK, where 89% of compounds showed >80% purity after 6 years of storage at −20 °C in sealed 384 deep-well blocks [7]. This study provided the impetus to update our reformatting processes so that we can establish the integrity of compounds at the time of purchase. Currently, we randomly check 12.5% of the compounds from a vendor plate by LCMS to confirm identity and purity. The QC procedure was applied to the Lead-like library described later in this manuscript.

Figure 1. Quality control and physicochemical property assessment of the existing SJCRH Chemical Library.

Figure 1.

(A) Overview of the procedure used to select compounds from the library to be assessed for identity and purity. (B) Distribution of observed compound purities from (A) as a function of 96- and 384- way storage formats. (C) Radar plot (top, contoured at the 25th, 50th, 75th, and 90th percentiles) and table of mean values (bottom) for the nine physicochemical properties and structural alerts calculated from the library. The values on each spoke of the radar plot represent the 2.5 – 97.5 percentile range for that parameter across all compounds in the SJCRH and Vendor (described later) libraries. The table is color coded from blue to red, indicating the lowest and highest observed value, respectively, for that parameter. The differences between the 4 sub-library means were statistically significant for all 10 properties (P < 0.001, One-way ANOVA). (D) Density plot (top) and table reporting model weights (bottom) for the first 2 dimensions obtained from applying linear discriminant analysis (LDA) using the 10 calculated values from (C) to classify the 4 SJCRH sub-libraries (Bioactives, Diversity, Focused, and Fragments). The table is color coded from blue to red, indicating the lowest and highest observed values, respectively. Densities are estimated at the 50th (solid) and 95th (dotted) percentiles. The coordinates for LDADim1 and LDADim2 are reported for each sub-library. (E) Structures of the compounds labelled #1–5 in (D). Molecular weight (mw), clogP (calculated logP), clogD (calculated logD at pH 7.4), topological polar surface area (psa), hydrogen-bond acceptors (hacc), hydrogen-bond donor (hdon), rotatable bonds (rbds), aromatic rings (aring), fraction sp3 atoms (fsp3), and structural alerts (alerts).

Physicochemical Properties.

To explore the distribution of physicochemical properties in our library, we used Pipeline Pilot (Biovia, 20.0.1) to calculate nine commonly utilized molecular descriptors – mw, clogP, calculated logD at pH 7.4 (“clogD”), topological polar surface area (“psa”), fraction of sp3 centers (“fsp3”), rotatable bonds (“rbds”), number of aromatic rings (“aring”), hydrogen-bond donors (“hdon”), and hydrogen-bond acceptors (“hacc”). To broadly evaluate chemical attractiveness based on the presence of reactive or assay-incompatible structural features, we used a combination of PAINS filters [5] and a modified version of the Pfizer filters [8] to compute structural alerts (“alerts”) (Supplementary Table S1). For our purposes, it was less critical to filter-out compounds containing certain toxicophores (e.g., aryl-NO2) or halogenated aromatics (e.g. aryl-Br) that were present in the full Pfizer filter list, since these functionalities modulate the electronics of aromatic rings and can be useful for uncovering structure-activity relationships in biochemical assays. The distribution of these molecular properties for the entire library at the 25th, 50th, 75th, and 90th percentiles is reported in the radar plot in Fig. 1C. We observed a balanced distribution of the molecular descriptors within drug-like chemical space [4] and structural alerts were infrequent.

To further interrogate this distribution, we classified our screening collection into 4 sub-libraries: (1) Bioactives – molecules with known or reported biological function, including FDA approved drugs, clinical candidates, or chemical tools; (2) Diversity – compounds obtained from commercial screening libraries that generally followed RO5 criteria; (3) Focused – molecules designed for a specific biological target or target class, such as analogs based around the scaffold of a known active or compounds identified through virtual screening techniques such as docking; and (4) Fragments – low mw compounds used for fragment-based screening that generally follow the Rule of 3 [9]. While Bioactives is the primary set used for drug repurposing screens and Fragments are not widely tested in cell-based assays, all sub-libraries are often screened in HTS projects involving biochemical targets. The mean value for each physicochemical property in the full library and 4 sub-libraries is reported at the bottom of Fig. 1C. As expected, Fragments displayed the lowest values for properties related to size and lipophilicity but were intermediate with respect to fsp3. Compared to Diversity, Bioactives tended to be lower in mw, lipophilicity, and aring, but higher in psa, hacc, hdon, fsp3, and alerts. Unexpectedly, our Focused compounds tended to be higher in mw, lipophilicity, rbds, and aring – properties which generally reflect potency optimization, but limit bioavailability.

Next, we used linear discriminant analysis (LDA) to better explore the physicochemical differences between the 4 sub-libraries. Whereas principal component analysis seeks to identify the dimensions that best capture the variance observed in a population, LDA maximizes discrimination between groups within the data – in this case, the 4 sub-libraries. Fig. 1D shows the chemical space defined by the projection of the first two LDA dimensions. The accuracy of the model for predicting sub-library membership was 0.59, compared to the no information rate of 0.25 (P < 0.001) (Supplemental Data). LDA dimension 1 (‘LDADim1’, abscissa) was strongly weighted by negative contributions from mw, hacc, and aring, meaning molecules with relatively higher values for these parameters were shifted west. Interestingly, LDA dimension 2 (‘LDADim2’, ordinate) was positively weighted by clogP, but negatively weighted by clogD. LogP is equivalent to logD for non-ionizable compounds, so these two features would nearly cancel each other out in this dimension unless ionizable groups were present. Likewise, hacc and hdon had strong positive weights, but were countered by a negative contribution from psa. Relatively lower values for rbds, and higher values for fsp3 and alerts, shifted molecules north. Examples of compounds in our library populating different regions of this chemical space are shown in Fig. 1E.

At the 50th percentile (solid lines), all libraries occupied a narrow portion of chemical space (−2.0 < LDADim1 < 2.0; −1.5 < LDADim2 < 1.5). Thus, despite differences in their etiology, the median compound from each of the 4 sub-libraries displayed a similar distribution of physicochemical property values. Diversity and Focused were distinct from Fragments, while Bioactives showed the broadest distribution of physicochemical properties and overlapped with the other sub-libraries. Bioactives also occupied the largest area of chemical space at the 95th percentile (dotted lines). Most of our library compounds – except those in Bioactives – were located at LDADim2 < 2.5. This analysis revealed under-represented regions of chemical space that could be targeted with library expansion.

Molecular scaffold analysis.

Scaffold analysis examines the distribution of molecular topologies and is another tool to interrogate a chemical library. We applied the Bemis-Murcko algorithm [10] in Pipeline Pilot to reduce our library molecules to contiguous ring systems plus chains that link two or more rings (Fig. 2A). This procedure generated 60,133 unique scaffolds. The distribution of molecules per scaffold for the entire library is reported in Fig. 2B. Nearly half of our scaffolds (N=29,665) were singletons, whereas 41% (N=24,708) were represented by 2–20 molecules. About 8% (N=4,973) of the total number of scaffolds had 20–100 representatives and only a small fraction of scaffolds (1.3%) had >100 analogs. This data is indicative of a screening collection that possesses high scaffold diversity but limited analog density.

Figure 2. Molecular scaffold analysis of the existing SJCRH chemical library.

Figure 2.

(A) Example of the application of Bemis-Murcko fragmentation to obtain a scaffold from a molecule. (B) Distribution of analog densities across the entire SJCRH library. (C) Total number of scaffolds and molecules per scaffold for the 4 SJCRH sub-libraries. (D) Analog density for the top 10 scaffolds in the library as a function of molecular weight. (E) Distribution of the nearest neighbor Tanimoto similarity (ECFP_6) value calculated for each scaffold in the library. The average similarity value is highlighted.

We next examined the distribution of scaffolds in our 4 library sub-types (Fig. 2C). As expected, Diversity had the largest number of scaffolds (34,356) and the highest number of molecules per scaffold (14.1). This portion of our compound library was designed to heavily sample analogs in order to enable rapid assessment of structure-activity relationships from screening hits. Focused and Fragments had about 3 compounds per scaffold. Not surprisingly, Bioactives had the lowest analog coverage as this sub-library was mainly designed to retain the most biologically active species per scaffold.

While performing this analysis, we were curious about the scaffolds containing the highest number of analogs. The 10 most represented scaffolds are shown in Fig. 2D. Phenyl had the greatest number of analogs (N=3,658), followed by N-phenylbenzamide (N=2,635), 2-phenoxy-N-phenylacetamide (N=2,165), and phenyl-benzenesulfonamide (N=2,077). The phenyl scaffold was enriched in molecules with mw < 300 and was present in 15.7% and 9.6% of Fragments and Bioactives, respectively, but only 1.4% of Focused and 0.2% of Diversity molecules. Most of the compounds associated with these scaffolds were purchased from vendors and only scaffold 9 was the result of an internal parallel-synthesis effort.

To evaluate diversity based on molecular similarities among scaffolds, we calculated the distribution of the nearest neighbor Tanimoto distance for each scaffold in our library using the ECFP6 fingerprint in Pipeline Pilot (Fig. 2E). The analysis of the topologies in our current library underscores a pre-existing high level of scaffold diversity and provides a reference from which to target novel scaffolds for library enrichment.

Historical performance.

When considering what strategy to implement in order to grow and enhance our compound collection, it was helpful to understand how well the existing library served the needs of our research community. We evaluated the productivity of the SJCRH compound collection from 2006–2019. The library was screened in at least 88 biochemical or cell-based assays, yielding several novel chemical probes including an allosteric activator of pantothenate kinase for the treatment of neurodegeneration [11], a potent and selective hPXR antagonist [12], the first reported inhibitor of MAGE-A11-mediated ubiquitination [13], and a first in-class clinical candidate for malaria [14]. The screening library contributed to over 60 publications in this time period (Supplementary Table 2), and campaigns were distributed between full deck, repurposing, and smaller scale focused screens.

To further explore historical performance, we investigated 16 biochemical and 12 cell-based screening campaigns that ranged in size from approximately 2,500 to 500,000 compounds tested. The average active rate across all screens was 4% with a range of 0.012 – 14%. Biochemical screens had an overall active rate of 2.5% compared to 5.9% for cell-based screens. The Bioactives sub-library averaged an active rate of 2.1% and 9.3% in biochemical and cell-based screens, respectively, compared to 2.9% and 5.1% for Diversity/Focused compounds. Approximately 550,000 compounds were screened in at least 10 different campaigns and were not active in any assay. Of these, 106,886 compounds were derived from a scaffold present in Dark Matter, a set of compounds defined by the absence of biological activity in at least 100 assays [15]. On the other hand, we detected few promiscuous compounds: 60,700 compounds were active in only 1 campaign, 11,038 compounds in 2–4 campaigns, 567 compounds in 5–10 campaigns, and only 8 compounds, all from the Bioactives, were active in more than 10 campaigns.

The library design strategy

Our retrospective analysis suggested that a topologically diverse, drug-like chemical library acquired from several commercial sources was successful at producing high quality starting chemical matter for lead generation. However, we also learned that the academic nature of our projects, and the heterogeneity of screening modalities employed, would benefit from greater flexibility to screen small sub-sets of the library that enable efficient sampling of biologically relevant chemical space. Moreover, we noted that the future target portfolio at our institution will increasingly include non-traditional targets, such as protein complexes as opposed to catalytic sites, and highly polar binding sites such as those found on proteins binding to nucleic acids.

After careful consideration, our team outlined the following strategic objectives: (1) expansion and focus in the lead-like physicochemical property space to enable efficient identification of starting points attractive to medicinal chemistry and amenable to validation via biophysical methods; (2) leveraging our existing library to create stand-alone structurally diverse subsets, based on specific physicochemical characteristics, that can be screened independently; (3) strategic acquisition or in-house synthesis of compound classes known to be enriched for biological activity (e.g., cyclic peptides, macrocycles, and peptidomimetics) [16] or reported to engage novel modalities such as targeted protein degradation [17]; and (4) continuous acquisition of the latest approved drugs, clinical candidates, and chemical probes, since a significant proportion of our projects involve drug repositioning with the goal of rapid translation into clinical trials. This project will be implemented in multiple stages over several years and adapted based on research needs and capabilities.

Implementation of the library design strategy

The Lead-like set (LL).

According to our physicochemical property analysis, most of the compounds in our chemical library were compliant with RO5. However, due to the well-established tendency to increase mw and logP during lead optimization [18], we recommended the acquisition of compounds occupying a lead-like space (mw < 350; clogP < 3) [19] and composed of scaffolds dissimilar to those present in our current library. This set would also be internally diverse, and of sufficient size (~20K) so that it could be screened alone and still provide a good chance of yielding tractable hits. We did not explicitly filter based on fsp3. Although higher fsp3 is reported to favorably impact solubility and selectivity [20], it is also associated with an increase in synthetic complexity and lower hit rates in HTS campaigns [21].

We began the design of LL by assembling compounds from five major vendors totaling 4,871,704 molecules (Fig. 3A). After removing duplicates, we filtered the set according to the criteria: 220 ≤ mw ≤ 300, clogP < 3, Tanimoto similarity (ECFP6) < 80% from our existing scaffolds, 5 analogs per scaffold, and free of chemical liabilities as assessed by the structural alerts described earlier. We chose a lower range of mw for this acquisition with the expectation of obtaining compounds between 300 and 350 Da at a later time. Relative to the number of scaffolds with 5 analogs, the percent of scaffolds with 10,15,20, and 25 analogs was 41%, 24%, 17%, and 13%, respectively. Therefore, we chose to acquire 5 analogs per scaffold to balance the goal of augmenting coverage of available scaffolds while reducing the chance of missing an active scaffold. Examples of compounds that were removed include a dialkylated aniline (PAINS filter) and a BOC protected pyrrolidine (modified Pfizer filter) (Fig. 3B).

Figure 3. Design of the Lead-like set.

Figure 3.

(A) Schematic of the workflow to identify compounds for acquisition. (B) Examples of molecules excluded because of structural alert liabilities. (C) Example of matched pair analogs representing the same scaffold.

From 353,329 compounds, application of the Bemis-Murcko method resulted in 5,868 scaffolds that were topologically distinct from the scaffolds in our existing library and contained ≥5 analogs available for purchase. We clustered these scaffolds using FCFP4 fingerprints with a maximum distance of 0.3 and picked the cluster head to reduce the number of scaffolds to 4,578. FCFP fingerprints abstract atom classes, and in our experience, do a better job than ECFP fingerprints at clustering similar scaffolds together. Five matched pairs (Fig. 3C) were randomly selected for the 4,578 scaffolds and the resulting 22,890 compounds were presented for visual inspection and selection by experienced medicinal chemists in our department using a shopping cart web application developed in-house. This application enabled chemists to deselect unattractive analogs and replace them with more attractive representatives or reject the whole scaffold cluster (Supplementary Fig. 2). Using this process, we identified 18,459 molecules from 3,819 novel scaffolds for purchase.

We then interrogated LL and the pooled vendor library from which it was selected (“Vendor”) using the same analytical framework described earlier. From the perspective of physicochemical properties and alerts, LL represented a sub-set of Vendor that was significantly restricted with respect to all parameters, except for fsp3 where it was enriched (Fig. 4A). Because we did not explicitly select for compounds with high fsp3, this enrichment must have been a side-effect of selecting scaffolds that were dissimilar to the ones in our existing library. We concluded that high fsp3 in the core of a molecule as opposed to the periphery was not only acceptable but even desirable. The scaffolds of such compounds project bond vectors in a different geometric orientation compared to more two-dimensional ones – and this characteristic is an important goal of scaffold diversity. Consistent with our design principles, we observed significant reductions in mw and lipophilicity. The aring parameter was less affected compared to others. It is interesting to note that the plot of physicochemical properties for Vendor was nearly identical to the one for our existing library (Fig. 1C), even though most of the commercial compounds in the SJCRH collection were purchased over a decade ago. We confirmed that the scaffolds in LL showed a low degree of similarity to each other and to the scaffolds in our existing library, with mean Tanimoto similarity of 0.54 and 0.52, respectively (Fig. 4B).

Figure 4. Analysis of the Lead-like and Gram-negative sets and comparison to biologically relevant chemical space.

Figure 4.

(A) Radar plot (top, contoured at the 25th, 50th, 75th, and 90th percentiles) and table of mean values (bottom) for the nine physicochemical properties and structural alerts calculated from Vendor and Lead-like. The radar plot and table were defined according to Fig. 1C. (B) Distribution of the nearest neighbor Tanimoto similarity (ECFP_6) value calculated for each scaffold in Lead-like vs. itself (top) or vs. each scaffold in the existing SJCRH library (bottom). The average similarity value is highlighted in each plot. (C) Density plot for Vendor, Lead-like, and GramNeg* (Gram-negative set molecules with clogP – clogD > 2) compounds using our LDA model. Densities are estimated at the 50th (solid) and 95th (dotted) percentiles. (D) The percent of scaffolds in 4 reference chemical spaces (Patents, ≥ 2016; Patents, <2016; Dark Matter, and CHEMBL) that are within a Tanimoto similarity distance ≥0.8 (ECFP_6) to any scaffold in Lead-like, SJCRH existing plus Lead-like (‘All-SJ’), and Vendor.

We then projected Vendor and LL onto the LDA chemical space described earlier (Fig. 4C). Consistent with the radar plot analysis, the center of the Vendor distribution (−0.43, −0.33) was close to the center of Diversity (−0.49, −0.55), whereas LL (0.68, −0.30) occupied an intermediate position between Diversity and Fragments.

The Gram-negative set (GramNeg).

LL contained molecules that were slightly lower in mw and lipophilicity than the median drug-like molecule, and therefore, was suitable for screening most biological targets. However, the cellular localization or binding sites of some targets might restrict ligands to an atypical range of physiochemical properties. One example is the class of small molecules that accumulate in Gram-negative bacteria: these compounds have low globularity and are rigid, amphiphilic, and tend to possess an amine [22]. To build the Gram-negative focused set, GramNeg, we filtered our existing library for compounds that were free of structural alerts and had clogD < 2. We used scaffold-based clustering as before to identify 10,000 diverse molecules and added an additional 5,500 molecules that possessed a single primary, secondary, or tertiary amine. The difference, clogP minus clogD, reflects the change in lipophilicity as a function of ionization, and is large for amphiphilic molecules. Interestingly, the larger this metric, the farther north compounds were projected on our LDA chemical space (Fig. 4C). The mean coordinate for GramNeg molecules with clogP minus clogD > 2 (GramNeg*) was (−0.29, 1.60), indicating access to a region of the LDADim2 dimension that was well sampled by only Bioactives in our original library. Globularity was low for this set, with a mean value equal to 0.03, while the average number of rotatable bonds was 5.1, which is near the reported cutoff for higher likelihood of accumulation. This work is an example of fulfilling our second library design objective: leveraging our existing collection to create a discrete, stand-alone compound set that was tailored to match certain physicochemical criteria according to a well-defined biologic rationale.

Targeting Novel Modalities.

Academic laboratories have been at the forefront of developing innovative strategies aimed to: (a) enable broader or more efficient exploration of chemical space (e.g., diversity- or biology- oriented synthesis [23,24] and DNA-encoded libraries [25]); and (b) target biomolecules or biological processes traditionally considered “undruggable” (e.g., small-molecule microarrays [26], protein-protein stabilizers [27,28], and targeted protein degradation). Pediatric cancers, a primary research focus at our institution, are often driven by transcription factors or fusion oncoproteins that are difficult to drug using conventional small molecule strategies. Therefore, in accord with our third library design objective, we made a strategic decision to explore targeted protein degradation strategies by building a dynamic molecular glue library. To date, we have synthesized more than 1,000 compounds designed to engage cereblon (CRBN), an adaptor protein for the cullin 4A RING E3 ligase complex. We leveraged the extensive literature knowledge around thalidomide, a known CRBN binder and the original molecular glue, and utilized a combination of modern medicinal chemistry principles and structure-based drug design to conserve the minimum pharmacophore features necessary for CRBN engagement while maximizing the chemical diversity displayed at the CRBN substrate binding surface. Encouraged by the initial screening efforts, we are now expanding the library to include molecules that engage other E3 ligases, such as MAGE A11 [13], DCAF15 [29], VHL1 [30], and MDM2 [31].

Scaffold novelty.

During this project, we were inspired to ask how the scaffolds in our existing library or new LL set compared to the scaffolds in other biologically relevant compound sets. We extracted Bemis-Murcko scaffolds from the following reference sets: (a) CHEMBL, a manually curated database of bioactive molecules [32], (b) Dark Matter, and (c) US patents granted or filed between 1991–2016 and 2016–2020 (more recent) that involve small molecule organic compounds with reported biological activity. We calculated the percentage of scaffolds in these 4 sets that were within the “neighborhood” (ECFP6 Tanimoto similarity ≥ 0.80) of scaffolds in LL, the updated SJCRH library (‘All-SJ’), and Vendor (representing the space of commercially available scaffolds) (Fig. 4D). Scaffold coverage should increase with the number of scaffolds present in the reference library, so we expected the coverage of LL to be low. However, All-SJ only covered 6–9% of scaffolds with reported biological activity (CHEMBL and Patents). Surprisingly, the scaffolds in our updated library covered nearly 28% of Dark Matter scaffolds. Moreover, even if we obtained the nearly 1.4 million scaffolds in Vendor, we would still access no better than 26% of the scaffolds with reported biological activity. While we found this limitation of commercial libraries surprising and disappointing, one could only speculate about the extent of potentially missed business and research opportunities.

Concluding Remarks

Recently, several pharmaceutical companies have disclosed strategies to enhance their compound collections [3335]. Common themes between those efforts and the approach documented here include: (a) a desire to make the chemical library dynamic through continuous curation and evolution; (b) a shift to more lead-like versus drug-like bias in physicochemical properties; (c) the elimination of compounds with problematic chemical moieties; and (d) a push to cover less sampled chemical space such as macrocycles or natural-product derived compounds. However, several distinctions between academic and industrial screening collections are evident. Our library was mainly populated by commercial compounds with moderate analog density, whereas a pharmaceutical company can achieve greater balance between proprietary vs. non-proprietary compounds and higher analog density due to large internal medicinal chemistry campaigns and library acquisitions designed specifically to capture novelty. Whereas chemical novelty and patentability are essential for commercial purpose in industry, freedom to operate around a scaffold is generally not a requirement for the basic research and preclinical proof of concept studies carried out in academia. Finally, heterogeneity in screening modalities and limited resources often restrict the size of academic screening campaigns. This propensity incentivizes a “library of libraries” design whereby one or more discrete sub-sets of the library can be screened independently according to the specific needs of a project. We have illustrated three examples in which discrete sub-libraries were generated to serve our academic research interests. Based on our experience over the last decade, and with extensive input from our stakeholders, we have established a paradigm that will help guide the evolution of our screening library to meet the challenges and demands for the next decade of screening in academia.

Supplementary Material

Supp Material: Fig S1, Fig S2, Tbl 1, Tbl 2, Model Summary

Acknowledgements

We are grateful for the support of the American Lebanese Syrian Associated Charities (ALSAC), and we would like to thank the patients, their families, and the staff at our institution. We thank Dalton Sides, Shalandus Garrett, Lei Yang, Brandon Young, P. Jake Slavish, Jaeki Min, and Julianne Bryan for their contributions to the SJCRH next-generation chemical library project.

References

  • 1.Arrowsmith CH, et al. (2015). The promise and peril of chemical probes. Nat Chem Biol, 11(8), 536–541. doi: 10.1038/nchembio.1867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bunnage ME, Chekler EL, & Jones LH (2013). Target validation using chemical probes. Nat Chem Biol, 9(4), 195–199. doi: 10.1038/nchembio.1197 [DOI] [PubMed] [Google Scholar]
  • 3.Frye SV (2010). The art of the chemical probe. Nat Chem Biol, 6(3), 159–161. doi: 10.1038/nchembio.296 [DOI] [PubMed] [Google Scholar]
  • 4.Lipinski CA, Lombardo F, Dominy BW, & Feeney PJ (2001). Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev, 46(1–3), 3–26. doi: 10.1016/s0169-409x(00)00129-0 [DOI] [PubMed] [Google Scholar]
  • 5.Baell JB, & Holloway GA (2010). New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem, 53(7), 2719–2740. doi: 10.1021/jm901137j [DOI] [PubMed] [Google Scholar]
  • 6.Lemoff A, & Yan B (2008). Dual detection approach to a more accurate measure of relative purity in high-throughput characterization of compound collections. J Comb Chem, 10(5), 746–751. doi: 10.1021/cc800100g [DOI] [PubMed] [Google Scholar]
  • 7.Blaxill Z, Holland-Crimmin S, & Lifely R (2009). Stability through the ages: the GSK experience. J Biomol Screen, 14(5), 547–556. doi: 10.1177/1087057109335327 [DOI] [PubMed] [Google Scholar]
  • 8.Blake JF (2005). Identification and evaluation of molecular properties related to preclinical optimization and clinical fate. Med Chem, 1(6), 649–655. doi: 10.2174/157340605774598081 [DOI] [PubMed] [Google Scholar]
  • 9.Congreve M, Carr R, Murray C, & Jhoti H (2003). A ‘rule of three’ for fragment-based lead discovery? Drug Discov Today, 8(19), 876–877. doi: 10.1016/s1359-6446(03)02831-9 [DOI] [PubMed] [Google Scholar]
  • 10.Bemis GW, & Murcko MA (1996). The properties of known drugs. 1. Molecular frameworks. J Med Chem, 39(15), 2887–2893. doi: 10.1021/jm9602928 [DOI] [PubMed] [Google Scholar]
  • 11.Sharma LK, et al. (2018). A therapeutic approach to pantothenate kinase associated neurodegeneration. Nat Commun, 9(1), 4399. doi: 10.1038/s41467-018-06703-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lin W, et al. (2017). SPA70 is a potent antagonist of human pregnane X receptor. Nat Commun, 8(1), 741. doi: 10.1038/s41467-017-00780-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yang SW, et al. (2020). Structural basis for substrate recognition and chemical inhibition of oncogenic MAGE ubiquitin ligases. Nat Commun, 11(1), 4931. doi: 10.1038/s41467-020-18708-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jimenez-Diaz MB, et al. (2014). (+)-SJ733, a clinical candidate for malaria that acts through ATP4 to induce rapid host-mediated clearance of Plasmodium. Proc Natl Acad Sci U S A, 111(50), E5455–5462. doi: 10.1073/pnas.1414221111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wassermann AM, et al. (2015). Dark chemical matter as a promising starting point for drug lead discovery. Nat Chem Biol, 11(12), 958–966. doi: 10.1038/nchembio.1936 [DOI] [PubMed] [Google Scholar]
  • 16.Zhang MQ, & Wilkinson B (2007). Drug discovery beyond the ‘rule-of-five’. Curr Opin Biotechnol, 18(6), 478–488. doi: 10.1016/j.copbio.2007.10.005 [DOI] [PubMed] [Google Scholar]
  • 17.Schapira M, Calabrese MF, Bullock AN, & Crews CM (2019). Targeted protein degradation: expanding the toolbox. Nat Rev Drug Discov, 18(12), 949–963. doi: 10.1038/s41573-019-0047-y [DOI] [PubMed] [Google Scholar]
  • 18.Perola E (2010). An analysis of the binding efficiencies of drugs and their leads in successful drug discovery programs. J Med Chem, 53(7), 2986–2997. doi: 10.1021/jm100118x [DOI] [PubMed] [Google Scholar]
  • 19.Teague SJ, Davis AM, Leeson PD, & Oprea T (1999). The Design of Leadlike Combinatorial Libraries. Angew Chem Int Ed Engl, 38(24), 3743–3748. doi: [DOI] [PubMed] [Google Scholar]
  • 20.Lovering F, Bikker J, & Humblet C (2009). Escape from flatland: increasing saturation as an approach to improving clinical success. J Med Chem, 52(21), 6752–6756. doi: 10.1021/jm901241e [DOI] [PubMed] [Google Scholar]
  • 21.Hansson M, et al. (2014). On the Relationship between Molecular Hit Rates in High-Throughput Screening and Molecular Descriptors. J Biomol Screen, 19(5), 727–737. doi: 10.1177/1087057113499631 [DOI] [PubMed] [Google Scholar]
  • 22.Richter MF, et al. (2017). Predictive compound accumulation rules yield a broad-spectrum antibiotic. Nature, 545(7654), 299–304. doi: 10.1038/nature22308 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Burke MD, & Schreiber SL (2004). A planning strategy for diversity-oriented synthesis. Angew Chem Int Ed Engl, 43(1), 46–58. doi: 10.1002/anie.200300626 [DOI] [PubMed] [Google Scholar]
  • 24.van Hattum H, & Waldmann H (2014). Biology-oriented synthesis: harnessing the power of evolution. J Am Chem Soc, 136(34), 11853–11859. doi: 10.1021/ja505861d [DOI] [PubMed] [Google Scholar]
  • 25.Goodnow RA Jr., Dumelin CE, & Keefe AD (2017). DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat Rev Drug Discov, 16(2), 131–147. doi: 10.1038/nrd.2016.213 [DOI] [PubMed] [Google Scholar]
  • 26.Koehler AN, Shamji AF, & Schreiber SL (2003). Discovery of an inhibitor of a transcription factor using small molecule microarrays and diversity-oriented synthesis. J Am Chem Soc, 125(28), 8420–8421. doi: 10.1021/ja0352698 [DOI] [PubMed] [Google Scholar]
  • 27.Andrei SA, et al. (2017). Stabilization of protein-protein interactions in drug discovery. Expert Opin Drug Discov, 12(9), 925–940. doi: 10.1080/17460441.2017.1346608 [DOI] [PubMed] [Google Scholar]
  • 28.Stevers LM, et al. (2018). Modulators of 14-3-3 Protein-Protein Interactions. J Med Chem, 61(9), 3755–3778. doi: 10.1021/acs.jmedchem.7b00574 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bussiere DE, et al. (2020). Structural basis of indisulam-mediated RBM39 recruitment to DCAF15 E3 ligase complex. Nat Chem Biol, 16(1), 15–23. doi: 10.1038/s41589-019-0411-6 [DOI] [PubMed] [Google Scholar]
  • 30.Buckley DL, et al. (2012). Targeting the von Hippel-Lindau E3 ubiquitin ligase using small molecules to disrupt the VHL/HIF-1alpha interaction. J Am Chem Soc, 134(10), 4465–4468. doi: 10.1021/ja209924v [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schneekloth AR, Pucheault M, Tae HS, & Crews CM (2008). Targeted intracellular protein degradation induced by a small molecule: En route to chemical proteomics. Bioorg Med Chem Lett, 18(22), 5904–5908. doi: 10.1016/j.bmcl.2008.07.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gaulton A, et al. (2017). The ChEMBL database in 2017. Nucleic Acids Res, 45(D1), D945–D954. doi: 10.1093/nar/gkw1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Follmann M, et al. (2019). An approach towards enhancement of a screening library: The Next Generation Library Initiative (NGLI) at Bayer - against all odds? Drug Discov Today, 24(3), 668–672. doi: 10.1016/j.drudis.2018.12.003 [DOI] [PubMed] [Google Scholar]
  • 34.Boss C, et al. (2017). The Screening Compound Collection: A Key Asset for Drug Discovery. Chimia (Aarau), 71(10), 667–677. doi: 10.2533/chimia.2017.667 [DOI] [PubMed] [Google Scholar]
  • 35.Saha A, et al. (2018). An Analysis of Different Components of a High-Throughput Screening Library. J Chem Inf Model, 58(10), 2057–2068. doi: 10.1021/acs.jcim.8b00258 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material: Fig S1, Fig S2, Tbl 1, Tbl 2, Model Summary

RESOURCES