Despite increased complexity over biochemical assays and substantial data imbalance, AL automatically identifies subsets of compounds which maximize prediction on external cytotoxic readouts. Systematic queries deduce reasons and perspectives.
Abstract
The NCI-60 cancer cell line screening panel has provided insights for development of subtype-specific chemical therapies and repurposing. By extracting chemical structure and cytotoxicity patterns, virtual screening potentially complements the availability of high-throughput assay platforms and improves bioactive compound discovery rates by computational prefiltering of candidate compound libraries. Many groups report high prediction performances in computational models of NCI-60 data when using cross-validation or similar techniques, yet prospective therapy development in novel cancers may have little to no such data and further may not have the resources to perform hit identification using large compound libraries. In contrast to bulk screening and analysis, the active learning methodology has demonstrated how to identify compounds for screening in small batches and update computational models iteratively, leading to predictive models with a minimum number of compounds, and importantly clarifying data volumes at which limits in predictive ability are achieved. Here, in replicate per-cell line experiments using 50% of data (∼20 000 compounds) as the external prediction target, predictive limits are reproducibly demonstrated at the stage of systematic selection of 10–30% of the incorporable half. The pattern was consistent across all 60 cell lines. Limits of predictability are found to be correlated to the doubling times of cell lines and the number of cellular response discontinuities (activity cliffs) present per cell line. Organization into chemical scaffolds delineated degrees of predictive challenge. These results provide key insights for strategies in developing new inhibitors in existing cell lines or for future automated therapy selection in personalized oncotherapy.
Introduction
Malignant neoplasms, one of the leading causes of death, are being shown through advancements in histopathology and genetics to have increasingly individual character.1 Most of the high-frequency types of tumors have been reasonably characterized in general, though there still exist many subtypes having poor molecular characterization, including clear cell ovarian cancer2 and spleen tumors such as spleen marginal zone lymphomas.3
The NCI-60 human tumor cell lines4 are 60 patient-derived cell lines from nine organs and the blood that have been cultured and systematically tested against a large chemical library for inhibitory activity. The dataset includes multiple cell line samples per primary tumor site, and serves as a showcase example toward understanding per-line inhibition development. The panel was used to identify factors for survival prediction on non-small cell lung cancer (NSCLC)5 and also used to identify microphthalmia-associated transcription factor as a lineage oncogene that is amplified in metastatic melanoma, with correlation to the survival of melanoma patients.6
Patient response to chemotherapy and emerging onco-immunotherapy strategies are highly variable.7–14 High-throughput experiments by automated culture, incubation, titration, and response to modulation pose a future possibility to semi-automate the search for tumor-specific compounds. However, having the instrumentation for each step is by itself still not sufficient. To minimize the cost of tumor-specific inhibitor discovery while maximizing actionable knowledge, computational virtual screening is an additional cog needed in the automation engine.
Many efforts to model the NCI-60 dataset have been reported. Staunton et al. used weighted voting classification to classify cell lines as either sensitive or resistant to individual compounds by analyzing gene expression data of the lines, finding it to be amenable for a subset of compounds.15 Lee et al. applied a co-expression extrapolation approach to extrapolate predictions to cell lines not included in the original NCI-60 database, where they identified a novel agent with activity against bladder cancer.16 Riddick and colleagues studied cell line sensitivity prediction from gene expression17 by applying the random forest technique18 where a collection of decision trees are each given a subset of available observations and each tree must build a set of sequential rules forming a tree that show the paths from observations or features to outcomes (in this case, gene expression value thresholds leading to chemical sensitivity prediction). More recently, Singh et al. compared different machine learning algorithms for compound inhibitory bioactivity prediction, reporting the ability to classify a compound as anti-cancer or not based on the criteria of number of cell lines inhibited,19 where the mathematically optimal support vector machine algorithm20 was successful. Finally, Xia et al. applied deeply-layered neural networks (deep learning) to predict the stratified classification of growth inhibition from compound descriptors and cell line gene expression, where they reported that the compound descriptors provided the largest contribution to predictive ability.21 The finding of dominant weight from chemical description in modeling was concordant with two other chemistry–biology descriptor value investigations.22,23
In many virtual screening studies, model performances are estimated through the use of cross-validation (CV) where half or more of the observations are made available to train and tune a predictive model, and the remaining data is used to evaluate the model's predictive performance. As the number of observations grows, the number of data made available to fitting models by CV also enlarges; as a corollary, very large datasets often report high predictive performances. Yet as discussed above, tumors display individual character, and if we are to advance to a level of efficient automated analysis and therapy discovery for individual tumors, systems that do not require massive data sets in order to be predictive are needed.
In this scope, recent efforts in ligand-target modeling have shown that highly predictive models can be obtained using only a small, strategically selected subset among all observations available for fitting models. Reker et al. demonstrated this approach in kinases and GPCRs,24 where 5–10% of ligand-target data was all that was required to achieve close to maximum predictive performance on the full dataset. Rakers et al. then challenged this idea by withholding targets and their ligand bioactivity profiles, demonstrating the maximum predictive ability on the external targets was achieved at the point of carefully selecting 10–30% of data available for training.25 In contrast to batch selection of compounds, both of these reports achieved their results by incremental cycles of ligand-target selection followed by model calculation, known as the active learning methodology. It is unknown, however, how well such a strategy can perform in cellular-level readouts, where the perturbated system to be modeled contains many latent signals and events implicitly incorporated with the cytotoxic readout, in contrast to biochemical assay readouts used in modeling of single targets.
Hence, medical needs and these novel iterative chemical selection strategies pose an important question – can cellular-level anticancer activities in each tumor type and cell line be extracted as patterns in a minimum-size model when an algorithm strategically selects the next chemical to incrementally include in the data to be modeled? In this study, we address this question by applying the active learning concept to each of the NCI-60 cell lines and assessing the amount of data necessary to converge on stable prediction performance on compound sets larger than the training data. In short, we observe that most cell lines achieve predictive convergence by strategically selecting from among 10–30% of available compounds and their anticancer activities. We also identified quantitative evidence that correlated with limits of predictive ability. The results provide important data for rational expectations on machine learning to identify novel compounds for existing cancer cell lines as well as spur the advancements in culture and assay robotics which can be employed in personalized therapy discovery or recommendation.
Results
Compound library background and selectivity
An investigation of compound structure and selectivity was performed, to aid in contextualization of modeling results achieved. Here, we incorporated additional cytotoxicity datasets available with the NCI-60 resource, resulting in a total of 159 lines. Table 1 lists the ten compounds with the most pan-cancer inhibitory activity on the expanded set of cell lines. Whereas the global dataset has a molecular weight 90th-percentile value of 472 Da (Table S1†), the ten pan-cancer inhibitors listed are generally larger than this.
Table 1. The 10 compounds with the most cytotoxic activity annotations in the NCI-60 data. Degree is the number of cell lines to which these compounds are classified as inhibitory by the threshold used in this report (GI50 < 1 μM). Here, the expanded NCI-60 dataset containing 159 cell lines is used, where the expanded cell lines have less compounds tested for cytotoxicity and therefore have partially incomplete data. Molecular weights are calculated after removal of salts. Circular fingerprints were computed with a radius of 2 and hashed to 4096 bits.
| Compound ID | Compound name | Pan-cancer degree | Mol. weight | MACCS bits | ECFP bits |
| 3053 | Dactinomycin | 143 | 1255 | 64 | 92 |
| 82151 | Daunorubicin hydrochloride | 134 | 528 | 50 | 66 |
| 49842 | Vinblastine sulfate hydrate | 127 | 811 | 64 | 99 |
| 125973 | Paclitaxel | 108 | 854 | 53 | 83 |
| 123127 | Doxorubicin hydrochloride | 102 | 544 | 54 | 68 |
| 740 | Methotrexate | 97 | 454 | 54 | 59 |
| 249992 | Amsidine | 91 | 393 | 51 | 42 |
| 67574 | Vincristine sulfate | 90 | 825 | 66 | 102 |
| 325663 | Saframycin A | 83 | 563 | 51 | 53 |
| 376128 | Dolastatin 10 | 81 | 785 | 55 | 87 |
We searched the PubChem database28 for biological backgrounds of several non-selective compounds. The most bioactive compound, dactinomycin, is a peptidomimetic macrocycle isolated from the bacterium Streptomyces parvulus whose therapeutic antineoplastic function comes from blocking the transcription of DNA by intercalating between adjacent guanine–cytosine base pairs.29 It is used in the treatment of solid tumors in children and choriocarcinoma in adult women. Doxorubicin is a similar agent which intercalates base pairs in DNA, but also inhibits topoisomerase II, subsequently preventing the ligation of a nucleotide strand after double-strand breaks and DSB repair. It is used in the therapy of lymphoma, leukemia, multiple types of sarcomas, and solid-organ neoplasms.30
We also asked if five structurally-diverse compounds of the ten most cytotoxic compounds had any type of structure–toxicity relationship. As shown in Fig. 1, the compounds commonly inhibit 50 cell lines though their structures are diverse; the rubicin compounds in Table 1 differ only between a methyl and an ethyl alcohol extending from the main macrocycle. Therefore, we would expect the rules continuously updated by active learning to expand as inhibitors are increasingly incorporated.
Fig. 1. The structures of five highly pan-cancer inhibiting compounds found in Table 1 and cell lines they inhibit. Data used for Venn diagram analysis has been extended to include cell lines which were not exhaustively profiled. Suggestions of selective inhibition of cell lines are potential artefacts of data incompleteness. When restricted to the 60 cell lines in which 20 000 or more compounds were all tested (upper right inset), more than 200 compounds are found to be universally active.
Based on dactinomycin's pan-cancer activity, we searched for other peptidomimetic macrocycles, finding 230 compounds in total with at least two peptide substructures in rings. Finally, extending the list to the 20 most cytotoxic compounds, we found that at least 7 (35%) of them including dactinomycin and doxorubicin were clinically-available anticancer agents.
Evaluation of compound representations and picking strategies
NCI-60 compounds having GI50 values of 1 μM or stronger (less) were considered actives, and compounds having GI50 values of 10 μM or weaker (more) were considered inactives. After data cleaning by quality controls (see Methods), datasets contained approximately 6–8% active annotations per cell line (see Fig. S1† for example distributions). For each line, half of the actives and inactives were randomly selected and held out as an external prediction challenge; this process was repeated five times to minimize artifacts from any one randomized trial. The average normalized similarity value between training and external prediction compounds was 0.33 (range 0–1, Fig. S2†). Compared to previous NCI-60 studies using cross-validation, the degree of external prediction challenge is significantly elevated, and compared to previous chemogenomic active learning studies, the substantial skew toward inactives is a more realistic setting in discovery contexts.
NCI-60 lines were evaluated through the active learning technique. The technique is considered active because models are continuously adapted to new patterns presented during the iterative addition of examples. In this context, examples from which to extract patterns are compound substructures or properties (e.g., sulfonamides or log P) along with their inhibitory label with respect to a cancer cell line. We use the random forest technique mentioned in the Introduction for model fitting, and test the three compound picking strategies of random selection, selection by maximum predictive uncertainty (below, “curiosity”), and selection by maximum inhibitory activity vote (below, “greedy”). Unless noted elsewhere when appropriate, the Matthews correlation coefficient (MCC) and F-measure (F1) were used as principal metrics of evaluation, respectively emphasizing balanced performance and ability to successfully predict bioactive compounds.
An example of the adaptive performance gain in actively learned models is shown in Fig. 2 for the SK-MEL-2 melanoma cell line (Fig. 2a for MCC and 2b for F1). In a first retrospective experiment, predictive ability on the full training set is evaluated; that is, from the pool of compound-bioactivity measurements that can be selected for inclusion in a model, instances are selected and the resulting model is used to predict bioactivity on the full set. This result explains the amount of data needed to cover all compound-bioactivity patterns that are present in the training data, in a high-resolution manner that is not clarified by standard machine learning evaluation approaches. Further, it is clear that systematically picking compounds (greedy or curiosity strategies) is more efficient than random selection of compounds.
Fig. 2. Iterative prediction improvement by active learning. [Top] Models with active picking methods achieve near-maximum performance on the melanoma SK-MEL-2 cell line external prediction data (approximately 20 000 compounds) using as little as 10% (2000) of available training data points strategically selected. Prediction model evaluation shown using MCC (a) and F-measure (b) metrics. For each picking strategy, its performance in predicting the full training dataset (retrospective) and the full external dataset (semi-prospective) are shown in paired colors. “n” refers to the number of experiment replicates for a picking method and prediction target, and “D” refers to the data count forming a curve plot. [Bottom] Active projection tracing of model performance dynamics using curiosity (c) and greedy (d) pickers. One execution of active learning using each picker is shown. Values in trajectories indicate the amount (%) of data used when a particular TNR/TPR performance is achieved. In the face of strong data imbalance, actively learned models demonstrate early inactive detection ability, whereas detection of actives (true positive rate) gradually improves as informative bioactive examples whose descriptors extrapolate to external compounds are incorporated.
In parallel to evaluation on the training data, we secondly and more critically ask how well the model predicts on the external dataset that can never be incorporated to the model and in which is always a minimum of twice as large as the training data. This is a challenge to answer how well the patterns in the anticancer compound-response training data extrapolate to the compounds whose activities are treated as pseudo-unknown (semi-prospective active learning). Fig. 2 thus summarily shows that the patterns extracted using 10% (∼2000) of the available compounds generalize well to the remainder of the training set, and that this data volume roughly corresponds to the maximum MCC and F1 scores on the external data. Adding more compounds to the models does not improve semi-prospective prediction.
We further determined the underlying mechanisms in improvement of MCC/F1 performance by visualizing the dynamics in true positive rate and true negative rate (inhibitor and non-inhibitor detection rate, respectively), known as the active projection method.26Fig. 2c and d respectively show the trajectories of SK-MEL-2 external dataset TPR/TNR dynamics for the curiosity and greedy compound picking strategies. In early iterations of compound selection and model calculation, all models first achieved high TNR but with poor detection of actives (TPR), indicating that models simply predicted most of the compounds as inactive. As the active learning cycle continues, active compounds and their patterns are increasingly uncovered, and because those patterns are transferrable to the compounds in the external data, it leads to improvements in TPR. The active projections complement the time-series analyses of results by showing that any small gains in MCC made by large increases in data volume are attributable to gains in bioactive compound detection (TPR), though such gains are minor at best beyond the 10% training data volume. An active projection deconstruction of behavior by random compound selection is provided as Fig. S3,† and projections for comparison of different runs using the greedy and curiosity pickers are provided in Fig. S4.†
Changing from a knowledge based organic chemistry substructure pattern descriptor to physiochemical or pharmacophore descriptors (see Methods) showed similar performance for the training data but somewhat reduced performance for the external data (Fig. S5†). However, given the difference in pan-cancer inhibitor structures (Fig. 1), we wanted to understand what substructural, physicochemical, or pharmacophoric features were contributing to predictive performance. We assessed the dynamic change in feature weights during the active learning cycle, where visualization demonstrated that specific groups of CATS2D descriptors are highly weighted (e.g. acceptor–lipophilic and lipophilic–lipophilic atom pairs), whereas some descriptors were never discriminative enough to be considered in decision rules (Fig. S6†). The frequent use of lipophilic descriptors is reasonable, as small molecules typically pass through cell membranes to achieve their effect.
One additional test of compound descriptor influence was performed. In this case, we used an expanded atom-centric radial neighborhood description to describe the compounds (see Methods), which has proven to be very effective in computational modeling but suffers from the drawback that resulting fingerprints might not be synthetically accessible. For organic chemistry descriptor versus dynamic substructure performance comparisons, we once again employed the SK-MEL-2 melanoma line, as well as the HCT-15 colon cancer cell line which had low external MCC performance. Resulting learning curves similar to Fig. 2 are given in Fig. S7 (SK-MEL-2) and S8† (HCT-15). While there is some gain in terms of MCC and F1 due to the expansion of fingerprints captured in training, it is not a substantial improvement, implying that limits on model performances are more inherent to the compound-response dataset itself, and less a function of chemical description. If we consider the machine's rate of correct prediction when predicting a compound to be active (termed the positive predictive value; see Methods), both the knowledge-based and dynamic description representations achieve PPV values between 0.85–0.90 (Fig. S7 and S8†).
In addition to MCC and F1, development teams in the pharmaceutical industry wish to know the number of actives detectable per each compound selected for screening that turns out to be inactive, often termed the enrichment factor (EF). Active learning was evaluated with this metric, as well as the recently proposed power metric (PM) which stresses active compound discriminative ability.27 Similar to MCC and F1 scores, limits on these metrics were achieved no later than the 10% data volume level (Fig. S9,† left), for both training and external data. The EF and PM were also evaluated for individual cell lines at the 5% data volume, which demonstrated that compounds chosen by active learning could result in EF values of 10 or higher for the external datasets of many cell lines (Fig. S9,† right), suggesting the efficiency of active learning for picking examples which lead to prediction of novel molecules with bioactivity.
Impact of cell doubling time
The histopathological and molecular characteristics of tumors vary extensively. For example, those with mesenchymal or proliferative subtypes in a primary tumor site often lead to worse prognoses than those with differentiated or immunoreactive subtypes. Subtypes and prognoses are impacted by the speed at which tumors replicate. Since the NCI-60 data incorporates this aspect of tumors by including multiple samples for each primary site but uniformly subjects all cell lines to the same fixed time period of incubation with inhibitors and measurement of growth inhibition, we investigated the distribution of predictive ability within a primary tumor site and with respect to tumor growth speed.
Fig. 3 shows the MCC value for each cell line, chemical descriptor, and compound picking strategy combination tested, where evaluation is done using the external dataset splits for each line, where again active learning was permitted to pick up to 50% of the training data. First, the intra-site variation was explored. As shown in Fig. 3a, predictive ability can fluctuate between samples of the same primary origin. For example, NSCLC prediction performance could differ by more than an MCC value of 0.1, which is substantial given the size of the external prediction data (∼20 000 compounds per experiment). Similar results were observed for ovarian and renal tumors, and for melanoma cell lines. Tables of MCC and F1 values (mean and standard deviation) for each cell line, chemical descriptor, and picking strategy at data volumes of 5/10/15/20% is provided as ESI.†
Fig. 3. Intra-site variability and doubling time impact. (a) The ability to predict inhibition in the external datasets for each of the 60 cell lines, shown for the random and curiosity pickers applied to each of three compound representations (columns). The predictability of cell lines from the same tumor site shows variability, consistent even when interchanging descriptors. Upper case “C/R” indicate picking by curiosity and random strategies. Lower case “m/c/p” indicate use of the MACCS/CATS2D/physicochemical descriptors. (b) The same prediction performances, reordered in terms of cell line doubling time. (c) The correlation between cell line doubling time and average MCC performance on the external prediction set, where 10% of the available training data has been picked by each strategy, based on convergence results in Fig. 1. (d) Correlations when 50% of the available training data is picked, as in panels (a) and (b). Two-sided p-values were computed for the regression slope hypothesis test whose null hypothesis is that the slope is zero.
Next, we reconsidered the results of Fig. 3a when cell lines were ordered by their reported doubling time. As shown in the heatmap of Fig. 3b, there is a clear trend between doubling time and a reduction in predictive ability. The trend was statistically significant, even for the random compound picking method, as shown in Fig. 3d. Stopping the active learning process at only 10% of picking from the available compound-bioactivity datapoints and repeating the analysis, MCC values were slightly reduced, but the trend between doubling time and predictive ability was still statistically significant for the curiosity and random picking strategies, as shown in Fig. 3c. While the trend between predictive ability and doubling time was not statistically significant at the 0.05 level for the greedy picker, the direction of the trend is nonetheless present.
The trends given in Fig. 3 are intuitive to rationalize. Cell lines that replicate faster within the allocated evaluation time generally lend themselves better to sigmoidal dose–response curve fitting and GI50 calculation. For those that are slower to replicate, this fitting process may encounter problems at lower concentrations of inhibitors, and inhibition is not clear until high concentrations of inhibitors are evaluated (e.g., pGI50 distributions in Fig. S1†). This potentially impacts the ability to correctly measure the inhibition ability of an inhibitor at a fixed concentration with respect to an identical number of tumor cells, and results in increased reports of non-activity. Given the natural imbalance away from active compounds in the data, increasingly more inactives as an artefact of a fixed incubation time skew models even further towards wide prediction of inactives (Fig. 2c and d), which makes the prediction of actives even more challenging, and in which the penalty for mistakes in predicting actives becomes even higher when applying the MCC metric. Framed in the context of previous discussion about the use of multiple metrics,31 such policy is a key to forecasting machine learning's ability to automate personalized oncology.
Impact of active compound ratio
Machine learning can result in the construction of biased estimators (models) when the endpoint ratio becomes exceedingly disproportionate, as is the case in most quantitative high-throughput screening (qHTS) studies. To investigate the effect of endpoint ratio on our results, ratios of active compounds were computed for each cell line, and we considered how picking algorithms would be impacted by such ratios. As shown in a previous study in a chemogenomic context,24 curiosity-based picking tends to select balanced numbers of active and inactive compounds regardless of how imbalanced the training datasets are. However, the temporal dynamics of the bioactive compound pick ratio are different for the NCI-60 lines because of the extreme (>90%) dominance of inactives, and here the impact on ratio dynamics when using the greedy strategy is potentially interesting.
First, we examined the relationship between the fractions of available bioactives picked at the 10% volume and the resulting external MCC performance. Fig. 4a visually shows the relationship for each of the three pickers. When compounds were picked randomly, an increase in actives picked directly correlated to improvement in performance. For cell lines in which the greedy picker increasingly selected up to 50% of data as actives, performance improved, yet there were also cell lines in which the greedy method picked more actives than actives at this stage, and in these cases, performance worsened. Active learning using the curiosity picker chose fewer actives than the greedy method, but more than random picking; performance was stable across lines at MCC ∼0.5 and not largely impacted by the underlying active compound ratio. This latter result matches our previous findings in both chemogenomic and single-target studies.22,24,32
Fig. 4. Ratios of active compounds picked and model performances. Amidst strong dataset imbalance across cell lines, strategic pickers choose compounds that yield better performances (panels (a) and (c) at 10% and 50% of available data picked). Where external performance has almost peaked at 10% picked (e.g., Fig. 1), the compounds picked by the curiosity strategy produce a stable prediction performance across all cell lines (panel (a)). The detailed temporal dynamics of pick ratio are shown for the SK-MEL-2 melanoma line in panel (b).
We considered the temporal dynamics of the pick ratio for the SK-MEL-2 line, whose underlying ratio of actives is 6%. The greedy and curiosity pickers sustain respective active pick ratios of 50% and 40% up through the first 5% of the dataset (Fig. 4b), after which the pick ratios continuously decline toward the underlying ratio. This data volume is also where the slope of the data-MCC curve (Fig. 2a and b) shifts from a high value to a low value, meaning that asymptotic limits of prediction are being approached, a fact reconfirmed by the active projections (Fig. 2c and d).
Increasing the volume of compounds picked to 50%, all picking methods demonstrated statistically significant correlation between predictive performance and fraction of available actives picked (Fig. 4c). Note, however, that it was not the case that all actives in the training data were picked (see Fig. S10† for absolute pick counts using the SK-MEL-2 and A549/ATCC lung cell lines). While the performance of the random picker at 50% data volume drew closer to the performances of the curiosity and greedy pickers, it is clear that the non-random pickers systematically pick more of the informed actives at either data volume, contributing to improved model performance. The raw data of the relationship between metrics and active compound ratio is shown in Fig. S11a and b† for MCC, with analogous analysis using F1 in Fig. S11c–f.†
Structure–response discontinuities causing inherent predictive difficulty
To further investigate what is preventing the models from achieving higher evaluation scores, we considered the issue of “discontinuity” in each cell line. Discontinuity can be defined by the fraction of compound pairs which are similar in terms of their structures and numerical representations but for which their bioactivity endpoint labels are opposite. As shown in Table 2, systematic estimation reveals thousands of training-external discontinuities for the SK-MEL-2, A549/ATCC, HCT-15, and MDA-N melanoma line datasets, whether the fixed substructure MACCS or dynamic ECFP representation was used. Two such discontinuities (or activity cliffs) are shown in Fig. 5a, where in the first pair two compounds both having a 5-ring conjugated scaffold are decorated by differing aromatic rings, and one compound having a methyl group in the para position relative to where the aromatic ring bonds to the larger ring system. With respect to the SK-MEL-2 cell line, their difference in inhibitory activity is 1.2 log units, but their descriptor representations may be nearly identical if not fully identical. In this case, the inactive external compound of the pair could be predicted correctly, yet in the other discontinuity pair of Fig. 5a, the active external compound could not be predicted correctly.
Table 2. The frequency of training-external activity discontinuities in a selection of cell lines. Discontinuities, also termed activity cliffs, are compounds with similar structure but opposite activity labels. Here, chemical similarity was defined by 166 bit MACCS representations having Tanimoto similarity (Tc) values of 0.8 or higher, or by 1024 bit ECFP (radius 2) representations having Tc values ≥ 0.5. A number of ECFP-based train-external discontinuities have Tc ≥ 0.8. Mean counts and standard deviations generated by repeated train-external splits and subsampling are used as estimates of the unknown global population of discontinuities.
| Cell line | MACCS-based (Tc ≥ 0.8) | ECFP-based (Tc ≥ 0.8) | ECFP-based (Tc ≥ 0.5) |
| SK-MEL-2 | 3502.4 ± 112.15 | 43.4 ± 4.84 | 1244.8 ± 18.15 |
| A549/ATCC | 3847.8 ± 200.82 | 52.4 ± 4.96 | 1322.4 ± 55.00 |
| MDA-N | 4623.0 ± 79.22 | 63.2 ± 6.05 | 1201.0 ± 24.45 |
| HCT-15 | 4620.6 ± 349.2 | 61.8 ± 12.06 | 1619.2 ± 83.33 |
Fig. 5. The relationship between discontinuity and model performances. (a, top) Included in Table 2, compounds 686554 and 698244 have similar structures but for SK-MEL-2, the pGI50 of training compound 686554 is 6.058 (874 nM) and thus classified as active, whereas the pGI50 of external compound 698244 is 4.824 (15 μM) and labeled as inactive. In five runs of AL using this specific train-external split, the training compound 686554 was picked by the curious strategy at 34.86%, 38.28%, and 44.32%, and not picked in the other two runs. Models using either MACCS or ECFP correctly predicted inactivity of 698224 in most iterations. (a, bottom) 667912 (training, inactive in SK-MEL-2) was picked in 5 out of 5 runs at 34.79, 42.6, 47.34, 48.67, 47.52%. 375501 (external, active) was correctly predicted only at 1–7 iterations using MACCS keys and 3–5 iterations using ECFP bits (incorrect at more than 9790 iterations). (b) Near the external performance convergence at 10% data picked, the metrics for the greedy picking method decline as datasets have more discontinuities, implying the intrinsic difficulty of prediction caused by datasets themselves. The overall trend was the same when evaluated by F-measure or other descriptors.
Quantitative investigation confirmed the expectation that increases in discontinuity pairs result in reduced predictive performances (Fig. 5b). This trend is in accordance with a previous finding about the performance of compound potency prediction models with respect to discontinuity in the dataset.32 The trend was particularly clear and statistically significant when active learning was stopped at the 10% data volume and using the curiosity or greedy picking strategies (Fig. 5b). In contrast, at the 50% data volume, the trend was not present (Fig. S12†). It is not surprising that the trend (non-zero regression slope) disappears at the 50% volume when one considers the inherent data ratio in each cell line and the expected ratio of actives to inactives picked at the end of training (Fig. 4b).
Another difference between the results at 10% and 50% can be rationalized by considering that decision trees in the larger data volume will have substantially more decision leaves built from the increase in compounds used for decision tree building, and yield prediction performances that are more invariant to the fraction of discontinuities in the training data; yet those trees will be full of decision leaves that contain only one compound satisfying a set of conditions, containing neighboring leaf nodes built from matching discontinuity pairs in the training data (which can be estimated to be present at the same rate as in the train-external splits), and it is unlikely such decision nodes extrapolate well to unseen data (e.g., difference in retrospective versus semi-prospective performance in Fig. 2). In the case of active learning only up to 10% of the training data, such discontinuity pairs will be less frequent in the data subsets given to each decision tree for rule construction.
Chemical scaffolds and their predictability
We computed the metrics of predictions for specific compound scaffolds (Bemis–Murcko frameworks) in order to investigate the correlations between scaffolds and their predictabilities. Fig. 6 shows the MCC and F1 scores of individual scaffolds during retrospective active learning on the A549/ATCC cell line, where scaffolds analyzed by MCC are restricted such that there is at least one compound in the scaffold which is active against A549/ATCC, and at least one compound in the scaffold that is inactive. Scaffolds become predictable and compounds in a scaffold become discernable for (in-)activity at different stages of the learning process, where the predictive ability on most scaffolds is determined after learning between 10–30% of available data (Fig. 6, left).
Fig. 6. MCC and F1 for compound scaffolds. [Upper panels] Bars on the upper plots represent the number of active and inactive compounds per scaffold (BM framework; benzene framework excluded). Frameworks without both active and inactive annotations do not qualify for MCC calculations, and are excluded in the left half, whereas any framework with active annotations can be evaluated by F1 and is included in the right half. The vertical axis for F1 scaffolds is truncated at 25 compounds for visualization purposes. [Middle panels] The evolution of metric performance is evaluated exclusively using each scaffold's compounds. Blank/white areas indicate that the metric could not be computed due to predictions (e.g., all compounds predicted as inactive at a given iteration). Six frameworks representing different performance dynamics are labeled (A)–(F) with their chemical structures shown and annotations of number of (in)active compounds having the framework noted.
Focusing exclusively on cytotoxic compounds (bioactives) and their scaffolds, many scaffolds were highly predictable early in the active learning process, as measured by F1 (Fig. 6, right). In many cases, the scaffolds that were predictable early on were those that had either a single compound or only a few compounds, all of which were active; it is therefore fair to expect that the decision trees found highly similar descriptor patterns for compounds in these types of scaffolds, and was successful in classifying them as bioactive/cytotoxic, as evidenced by the F1 values. An example scaffold that was emergently predictable is shown at the bottom of Fig. 6 (scaffold B). For scaffolds with many more compounds which were dominantly inactive, F1 values were either unfavorable or mixed, which is an intuitive result.
At this junction, we also performed an active learning prediction challenge using a dataset split based on compound scaffolds (Fig. S13†), where we selected scaffolds in the lower half of average MCC in Fig. 6 and held them out as the test dataset (2116 compounds; remainder of approximately 37 500 compounds could be picked and learned from). MCC and F1 performances on the external dataset were lower than the random split, though PPV values were consistent at 0.8, leading us to deduce that the gap with MCC/F1 is due to false negatives. PPV was similar for the random splits of this NSCLC line (data not shown). Combining these facts yields the interpretation that, even without additional computations of train-external discontinuities, training nearest neighbors of the scaffold-based external set contained inactives which led the model to predict incorrectly, particularly for the scaffolds with negative MCC values in the original experiment (Fig. 6).
Discussion and conclusion
In applying active learning, regardless of chemical descriptor used, models were able to converge on maximum MCC performance of approximately 0.6 using 10% of available compound-bioactivity data points. This indicates that despite the strong active–inactive imbalance in response data, an accurate model of each cell line can be obtained using only a small portion of informative compound data. At the 10% data volume, models predicting compounds as active were correct approximately 85% of the time (Fig. S7 and S8†). Using more data points did not guarantee better model performance. This is in accordance with previous reports on active learning using different prediction objectives.22,25,33–36 However, compared to biochemical assay active learning, cellular-lever active learning contains more uncertainty and factors beyond one's control, which reinforces the efficiency of the technique.
This was also the case when we evaluated models using the F1 score, as shown in Fig. 2b. The visualization using metric surfaces (Fig. 2c and d) revealed that models first achieved high TNR with low TPR, indicating that models first predict most compounds as inactive, which is reasonable given not only the active-to-inactive ratio in the screening data, but further that even the curiosity picker will find challenging examples within the vast pool of inactive compounds, and may be preferentially drawn to inactive compound spaces early in the active learning process. Despite its technical formulation to preferentially select the compound mostly likely to be an active, the greedy picker did not jump to active ratios as high as 80% or 90%, as it did in chemogenomic scenarios (Fig. 4c). This means that a number of compounds were picked by the greedy strategy in anticipation that they would be actives, but they were ultimately inactive as per the 10 μM threshold applied in the work. Beyond the first 1% of data, the actives picked and learned from lead to increases in F1 and MCC scores as training continued (Fig. 2c and d and 6). It was clear that strategic non-random picking lead to higher selection frequency of actives (Fig. 4b and S10†), and with better model balance, better prediction performances on external datasets (Fig. 4).
The MCC and F1 model metrics vary considerably across cell lines (Fig. 3a). While lines were consistently incubated for 48 hours prior to drug addition, we could identify that cell line doubling time had an impact on the interpretation of inhibition measurement, the resulting classifications, and the resulting model performances (Fig. 3b). Future efforts to automate screening and mapping of each patient's chemical response terrain will need to incorporate this biological background into protocol design and readout interpretation. Based on previous studies that demonstrated the use of gene expression as a valuable marker in predicting response to a compound,15–17 but that compound descriptors were higher weighted in feature analyses once sufficient chemical response data was available, it is foreseeable that gene expression from a patient can be used to suggest informative initial rounds of screening compounds, the results of which can then be fed to a chemical-centric instance of active learning.
We also discovered that there were a non-trivial number of compound-activity discontinuities in the datasets (Table 2 and Fig. 5), even though there is a moderate difference in compound structures between training test datasets (Fig. S2†). The quantitative evaluation of frequency of such discontinuities in a cell line helps us formulate expectations on the ability to predict cytotoxic behavior of compounds, and there was a trend between the frequency of discontinuities for a cell line and resulting prediction performance on external datasets (Fig. 5). In these experiments, we applied an objective 10-fold gap as the criteria to clearly separate compounds into active and inactive classes; such a gap may be insufficient if there is substantial variation in GI50 readout values for individual cell lines, and again, if there are experimental conditions such as fixed incubation and titration time which have the potential to impact the computed inhibitory values. While we have applied an objective non-zero gap to separate data classes, it is possible that measurement variation nonetheless contributed to a portion of discontinuities in the dataset. In future experiments with new cell lines, preliminary control experiments to determine readout variance from replicate measurement as well as establishment of appropriate bounds for classification will contribute to refined expectations of model utility.
We found that the curiosity-based picking strategy was generally stable in terms of prediction performance regardless of the active–inactive data ratio (Fig. 4). The curiosity strategy has a strong relationship with the concept of the support vector machine algorithm, that is, to identify the most challenging “border cases” as the informative examples by which other examples can be best classified. The difference in the two is that the SVM algorithm uses all examples available for one-time calculation of a model, whereas curiosity-based active learning incrementally adds informative examples in an iterative fashion.
One question that naturally arises is how to push prediction performance higher with NCI-60 data. As mentioned above, we envision that gene expression contributes to pools of initial screening data, and our previous results do not suggest that gene expression will boost predictive performance after hundreds or thousands of compounds and their readouts are available.23 With respect to model building, techniques such as the synthetic minority oversampling technique (SMOTE)37 are one option to generate models with more actives, at least by the greedy or random methods. Curiosity would not pick the re-sampled instances as, similar to the discussion of SVM above, the resamplings would not be considered curious instances beyond the first time they are picked. SMOTE was recently investigated by the Hareesha group on small-scale chemogenomic datasets and found to improve F1 and other measures by a few percentage points.38 Another alternative would be undersampling, in which the dominant inactive class would be reduced in size prior to modeling; a collection of undersampling techniques are available from the imbalanced-learn project (; imbalanced-learn.org). Active learning demonstrates that undersampling is a viable strategy for virtual screening in drug discovery (e.g., Fig. 4).
Overall, this work demonstrates the practicality of active learning in cancer inhibitor modeling and prediction as a key element in future automated laboratories, as well as the quantitative demonstration of limits of active prediction using compound libraries exhibiting structure–activity discontinuities. Selection of compounds for modeling using greedy and curiosity strategies results in faster coverage of the limited actives, and this can be critical for continual discovery of multiple inhibitors which can then be considered for further optimization or potential application as-is. Application of active learning should also help reduce screening costs, an important factor if personalized screening and therapy discovery is to become a reality. If the volume of compounds screened can be kept to a minimum, then it will also be possible to apply decision tree visualization methods (e.g., as performed by Polash and colleagues for deciphering chemogenomic model decision mechanisms22) and identify features which have differentiation ability at an individual level.
To achieve full automation for all of these goals will require substantial efforts but is rationally achievable. There is much work to be done, and the coming developments will be encouraging.
Methods
Datasets
The NCI-60 human tumor cell line panel and compound GI50 measurements were obtained in their original form from the NCI website ; https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data, June 2016 release). The original dataset contained 3 054 004 datapoints. First, duplicate entries were removed and replaced with entries having averaged endpoint values. Compounds with average GI50 values less than 1 μM were classified as active, and those with average GI50 greater than 10 μM were classified as inactive. Compounds of intermediate activity were omitted. A washing procedure to strip salts, perceive chirality, and eliminate molecules with metal atoms was executed (OpenEye OEChem). The resulting data retained 2 443 141 data points. The distributions of the pGI50 values before dataset split and after dataset split are shown in Fig. S1† for three cell lines.
Globally, 9.4% (230 580) compound-cell line points were classified as active, with the remaining 90.6% (2 212 561) points classified as inactive. Cell lines with more than 20 000 points were used for classification experiments, resulting in 60 cell lines tested by active learning, though inhibitory activity of compounds against all cell lines was used for results reported in Table 1. More than 200 compounds were active against the 60 cell lines tested.
For the retained compounds, we computed the MACCS 166 bit fingerprint, the CATS2D39 representation available in the Dragon 7 package (Talete s.r.l.40), and the physicochemical descriptors available in the Dragon 7 package (descriptor blocks 1, 2, and 20 corresponding to descriptors 1–79 and 4839–4855). MACCS and ECFP (Rogers and Hahn) representations were computed using the implementation provided in the OpenEye OEChem libraries (ECFP radius 2, 4096 bit hash or 1024 bit hash for acceleration of discontinuity calculation). In some instances, calculation of descriptors failed, notably for physicochemical descriptors; in these situations, the compounds and their endpoints were removed from the data used in active learning experiments. We mainly focused on the MACCS fingerprint because of its computational simplicity and interpretability.
Machine learning
Each cell line was modeled and assessed for predictability independently. Compounds in each cell line were split into equal size subgroups for model construction (training) by active learning and for external prediction, with equal numbers of actives and inactives per split. Train-test splitting was performed five times per cell line. Active learning was performed using in-house scripts24,41 based on the random forest implementation available in Scikit-learn, version 0.20.3.42 Random, curious (maximum uncertainty as quantified by prediction vote variance), and greedy (maximum vote confidence in prediction) picking methods were tested. Active learning experiments of each train/test split were performed five times, where differing seed values result in different starting points in the active/inactive compound space, and thus a total 5 × 5 = 25 experiments per cell line.
Model evaluation
Once in every 10 iterations of compound picking, models were evaluated for their prediction performance on the full training and external datasets. Matthews correlation coefficient (MCC43) and F1 were employed as the principal benchmarks. F1 employs the positive predictive value (PPV). Active projection26 uses the true negative rate (TNR) and true positive rate (TPR) metrics as axes. A subset of experiments were evaluated with the power metric (PM) and enrichment factor (EF) metrics. Metrics were computed using Scikit-learn modules or by in-house Python and Rust language scripts, as per the following formulas.
Discontinuity calculation
In-house software was developed to calculate discontinuities in compound-bioactivity data. The Tanimoto measure was used to measure similarity of compounds. We retained compound pairs with a similarity of 0.8 or higher, and further retained only those compound pairs with opposite bioactivity labels (e.g., one compound as active in a cell line and the other as inactive).
The count of discontinuities detected was divided by the number of compound pairs for which pairwise similarity was computed. Due to software limitations in our implementation, we repeatedly subsampled 10 000 compounds from each cell line and used the mean rate of discontinuities to estimate the true discontinuity rate.
Ratio-performance statistical significance tests
To evaluate the correlation between discontinuity or active compound ratio against the value of a metric when using all 60 cell lines, two-sided p-values were computed for the regression slope hypothesis test whose null hypothesis is that the slope is zero. The test was performed by calling the stats.linregress implementation in SciPy (version 1.2.1).44
Compound scaffold grouping
In-house software built on top of the OpenEye OEMedChem library was used to divide compounds into groups based on their Bemis–Murcko scaffolds (or “frameworks”, that is, a compound's set of rings and the minimum linkers necessary to connect the rings). Compounds were then grouped by the scaffold to which they belong, and per-scaffold MCC and F1 analyses of active learning prediction performance were executed.
Conflicts of interest
J.B. Brown declares a potential conflict of interest as a consultant for the pharmaceutical industry. He declares a grant from Daiichi-Sankyo Pharmaceutical Company (TaNeDS grant A60093) which was not involved in the conceptualization, execution, or reporting of this work.
Supplementary Material
Acknowledgments
This work was supported through the Japan Society for the Promotion of Science, in grants 16H06306 (to ST, JB) and 17K20043 (JB). It was also supported by the JSPS Core-to-Core A: Advanced Research Networks program. TN expresses appreciation to the Japan Student Services Organization (JASSO) for related support during the execution of this work. The authors thank Prof. Masatoshi Hagiwara of Kyoto University, and Dr. Fathi Elloumi, Dr. Sudhir Varma, Mr. William Reinhold, and Dr. Yves Pommier of the National Cancer Institute (USA) for discussions during the development of this work. The authors gratefully acknowledge an academic license for software from OpenEye Scientific Software. Software developed by the open source software development community also contributed in part to this work.
Footnotes
†Electronic supplementary information (ESI) available: Raw data of MCC and F1 performance for prediction on each cell line is provided (60 lines × 3 pickers × 3 descriptors = 540 mean and standard deviation of performance metrics), for experiments at the level of picking 5/10/15/20% of training data. Also, the endpoint-annotated vectorial representations of datasets that were used (e.g. Fig. S5) are provided. See DOI: 10.1039/d0md00110d
References
- Yokoyama A., Kakiuchi N., Yoshizato T., Nannya Y., Suzuki H., Takeuchi Y., Shiozawa Y., Sato Y., Aoki K., Kim S. K. Nature. 2019;565:312. doi: 10.1038/s41586-018-0811-x. [DOI] [PubMed] [Google Scholar]
- Murakami R., Matsumura N., Brown J. B., Higasa K., Tsutsumi T., Kamada M., Abou-Taleb H., Hosoe Y., Kitamura S., Yamaguchi K. Am. J. Pathol. 2017;187:2246–2258. doi: 10.1016/j.ajpath.2017.06.012. [DOI] [PubMed] [Google Scholar]
- Arcaini L., Rossi D., Paulli M. Blood. 2016;127:2072–2081. doi: 10.1182/blood-2015-11-624312. [DOI] [PubMed] [Google Scholar]
- Shoemaker R. H. Nat. Rev. Cancer. 2006;6:813. doi: 10.1038/nrc1951. [DOI] [PubMed] [Google Scholar]
- Cortés-Ciriano I., van Westen G. J. P., Bouvier G., Nilges M., Overington J. P., Bender A., Malliavin T. E. Bioinformatics. 2015;32:85–95. doi: 10.1093/bioinformatics/btv529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garraway L. A., Widlund H. R., Rubin M. A., Getz G., Berger A. J., Ramaswamy S., Beroukhim R., Milner D. A., Granter S. R., Du J. Nature. 2005;436:117. doi: 10.1038/nature03664. [DOI] [PubMed] [Google Scholar]
- Buzatto I. P. C., Ribeiro-Silva A., de Andrade J. M., Carrara H. H. A., Silveira W. A., Tiezzi D. G. Braz. J. Med. Biol. Res. 2017;50(2):e5674. doi: 10.1590/1414-431X20165674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ariyan C. E., Brady M. S., Siegelbaum R. H., Hu J., Bello D. M., Rand J., Fisher C., Lefkowitz R. A., Panageas K. S., Pulitzer M. Cancer Immunol. Res. 2018;6:189–200. doi: 10.1158/2326-6066.CIR-17-0356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hosomi Y., Morita S., Sugawara S., Kato T., Fukuhara T., Gemma A., Takahashi K., Fujita Y., Harada T., Minato K. J. Clin. Oncol. 2020;38:115–123. doi: 10.1200/JCO.19.01488. [DOI] [PubMed] [Google Scholar]
- Cameron D. A., Gabra H., Leonard R. C. F. Br. J. Cancer. 1994;70(1):120–124. doi: 10.1038/bjc.1994.259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adamczyk P., Juszczak K., Drewa T. Curr. Opin. Urol. 2017;27(1):80–84. doi: 10.1097/MOU.0000000000000355. [DOI] [PubMed] [Google Scholar]
- Obel J. C., Friberg G., Fleming G. F. Clin. Adv. Hematol. Oncol. 2006;4(6):459–468. [PubMed] [Google Scholar]
- Hamanishi J., Mandai M., Ikeda T., Minami M., Kawaguchi A., Murayama T., Kanai M., Mori Y., Matsumoto S., Chikuma S. J. Clin. Oncol. 2015;33:4015–4022. doi: 10.1200/JCO.2015.62.3397. [DOI] [PubMed] [Google Scholar]
- Hatae R., Chamoto K., Kim Y. H., Sonomura K., Taneishi K., Kawaguchi S., Yoshida H., Ozasa H., Sakamori Y., Akrami M., Fagarasan S., Masuda I., Okuno Y., Matsuda F., Hirai T., Honjo T. JCI Insight. 2020;5(2):e133501. doi: 10.1172/jci.insight.133501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staunton J. E., Slonim D. K., Coller H. A., Tamayo P., Angelo M. J., Park J., Scherf U., Lee J. K., Reinhold W. O., Weinstein J. N. Proc. Natl. Acad. Sci. U. S. A. 2001;98:10787–10792. doi: 10.1073/pnas.191368598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J. K., Havaleshko D. M., Cho H., Weinstein J. N., Kaldjian E. P., Karpovich J., Grimshaw A., Theodorescu D. Proc. Natl. Acad. Sci. U. S. A. 2007;104:13086–13091. doi: 10.1073/pnas.0610292104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riddick G., Song H., Ahn S., Walling J., Borges-Rivera D., Zhang W., Fine H. A. Bioinformatics. 2010;27:220–224. doi: 10.1093/bioinformatics/btq628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L. Mach. Learn. 2001;45:5–32. [Google Scholar]
- Singh H., Kumar R., Singh S., Chaudhary K., Gautam A., Raghava G. P. S. BMC Cancer. 2016;16:77. doi: 10.1186/s12885-016-2082-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cristianini N., Shawe-Taylor J. and others, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, 2000. [Google Scholar]
- Xia F., Shukla M., Brettin T., Garcia-Cardona C., Cohn J., Allen J. E., Maslov S., Holbeck S. L., Doroshow J. H., Evrard Y. A. BMC Bioinf. 2018;19:486. doi: 10.1186/s12859-018-2509-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polash A. H., Nakano T., Takeda S., Brown J. B. Molecules. 2019;24:2716. doi: 10.3390/molecules24152716. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakano T., Brown J.B. J. Comput. Aided Chem. 2020;21:1–10. [Google Scholar]
- Reker D., Schneider P., Schneider G., Brown J. B. Future Med. Chem. 2017;9:381–402. doi: 10.4155/fmc-2016-0197. [DOI] [PubMed] [Google Scholar]
- Rakers C., Najnin R. A., Polash A. H., Takeda S., Brown J. B. ChemMedChem. 2018;13:511–521. doi: 10.1002/cmdc.201700677. [DOI] [PubMed] [Google Scholar]
- Brown J. B. Future Med. Chem. 2018;10:1885–1887. doi: 10.4155/fmc-2018-0188. [DOI] [PubMed] [Google Scholar]
- Lopes J. C. D., dos Santos F. M., Martins-José A., Augustyns K., De Winter H. J. Cheminf. 2017;9:7. doi: 10.1186/s13321-016-0189-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B. A., Thiessen P. A., Yu B. Nucleic Acids Res. 2018;47:D1102–D1109. doi: 10.1093/nar/gky1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodman L. S. and others, Goodman and Gilman's the pharmacological basis of therapeutics, McGraw-Hill, New York, 1996, vol. 1549. [Google Scholar]
- Derived PubChem Compound record for Doxorubicin PubChem Identifier: CID31703.
- Brown J. B. Mol. Inf. 2018;37:1700127. [Google Scholar]
- Ahmadi M., Vogt M., Iyer P., Bajorath J., Fröhlich H. J. Chem. Inf. Model. 2013;53(3):553–559. doi: 10.1021/ci3004682. [DOI] [PubMed] [Google Scholar]
- Rakers C., Reker D., Brown J.B. J. Comput. Aided Chem. 2017;18:124–142. [Google Scholar]
- Kangas J. D., Naik A. W., Murphy R. F. BMC Bioinf. 2014;15:143. doi: 10.1186/1471-2105-15-143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lang T., Flachsenberg F., von Luxburg U., Rarey M. J. Chem. Inf. Model. 2016;56:12–20. doi: 10.1021/acs.jcim.5b00332. [DOI] [PubMed] [Google Scholar]
- Tong S., Koller D. J. Mach. Learn. Res. 2001;2:45–66. [Google Scholar]
- Fernández A., García S., Herrera F., Chawla N. V. J. Artif. Intell. Res. 2018;61:863–905. [Google Scholar]
- Redkar S., Mondal S., Joseph A., Hareesha K. S. Mol. Inf. 2020;39(5):e1900062. doi: 10.1002/minf.201900062. [DOI] [PubMed] [Google Scholar]
- Schneider G., Neidhart W., Giller T., Schmid G. Angew. Chem., Int. Ed. 1999;38(19):2894–2896. [PubMed] [Google Scholar]
- H. kode-solutions. ne. Kode srl, Dragon (software for molecular descriptor calculation) version 7.0.4, 2016.
- Reker D. and Brown J. B., in Methods in Molecular Biology, 2018. [DOI] [PubMed] [Google Scholar]
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Matthews B. W. Biochim. Biophys. Acta, Protein Struct. 1975;405(2):442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]
- Virtanen P., Gommers R., Oliphant T. E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., van der Walt S. J., Brett M., Wilson J., Millman K. J., Mayorov N., Nelson A. R. J., Jones E., Kern R., Larson E., Carey C. J., Polat İ., Feng Y., Moore E. W., VanderPlas J., Laxalde D., Perktold J., Cimrman R., Henriksen I., Quintero E. A., Harris C. R., Archibald A. M., Ribeiro A. H., Pedregosa F., van Mulbregt P., Vijaykumar A., Bardelli A. P., Rothberg A., Hilboll A., Kloeckner A., Scopatz A., Lee A., Rokem A., Woods C. N., Fulton C., Masson C., Häggström C., Fitzgerald C., Nicholson D. A., Hagen D. R., Pasechnik D. V., Olivetti E., Martin E., Wieser E., Silva F., Lenders F., Wilhelm F., Young G., Price G. A., Ingold G. L., Allen G. E., Lee G. R., Audren H., Probst I., Dietrich J. P., Silterra J., Webber J. T., Slavič J., Nothman J., Buchner J., Kulick J., Schönberger J. L., de Miranda Cardoso J. V., Reimer J., Harrington J., Rodríguez J. L. C., Nunez-Iglesias J., Kuczynski J., Tritz K., Thoma M., Newville M., Kümmerer M., Bolingbroke M., Tartre M., Pak M., Smith N. J., Nowaczyk N., Shebanov N., Pavlyk O., Brodtkorb P. A., Lee P., McGibbon R. T., Feldbauer R., Lewis S., Tygier S., Sievert S., Vigna S., Peterson S., More S., Pudlik T., Oshima T., Pingel T. J., Robitaille T. P., Spura T., Jones T. R., Cera T., Leslie T., Zito T., Krauss T., Upadhyay U., Halchenko Y. O., Vázquez-Baeza Y. Nat. Methods. 2020;17(3):261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






