Summary
Large cancer cell line collections broadly capture the genomic diversity of human cancers and provide valuable insight into anti-cancer drug response. Here, we show substantial agreement and biological consilience between drug sensitivity measurements and their associated genomic predictors from two publicly available large-scale pharmacogenomics resources: The Cancer Cell Line Encyclopedia and the Genomics of Drug Sensitivity in Cancer.
In vitro pharmacologic sensitivity studies performed across panels of molecularly characterized cancer cell lines have proved useful in assessing the cellular activity of many compounds, assigning mechanisms of drug action, and determining genetic contexts for distinct cancer vulnerabilities1–6. A recent comparison study7 of the Cancer Cell Line Encyclopedia (CCLE)8 and the Genomics of Drug Sensitivity in Cancer (GDSC)9 reported poor correlations between their pharmacologic data, and questioned the validity of their conclusions. These observations raised important questions for the field about how best to perform comparisons of large-scale datasets, evaluate the robustness of such studies, and interpret their analytical outputs.
To address these questions, we first performed a comparative analysis of CCLE and GDSC drug screening metrics. For this analysis, we used both the 50% inhibitory concentration (IC50) and the Area Under the Curve (AUC – also referred to as Activity Area in CCLE when considering 1-AUC). Importantly, the IC50 values in both datasets were capped at the maximum tested drug concentrations, and the same fixed scale was applied across all compounds (Supplementary data 1). Of note, while 471 cell lines are present in both CCLE and GDSC collections and have associated genomic data, only a subset of those have overlapping drug screening data: a range of 82-256 cell lines per compound (median = 94 cell lines; mean = 157, figure 1a and Supplementary Data 1).
Our analytical approach was designed to account for the fact that many pharmacologic profiles exhibit highly discontinuous distributions across cancer cell line collections. Whereas a subset of individual lines may show marked pharmacologic sensitivity, the remaining lines—often the vast majority of cell lines in the collection—may be relatively insensitive to a given drug. Such ‘outlier’ distributions are expected as they are frequently observed for drugs that target specific oncogenic dependencies. Given the relative paucity of sensitive outliers, appropriate pharmacologic assessments require multiple drug-sensitive cell lines for each compound and the ability to discern this relevant signal against a background dominated by the insensitive majority. Additionally, small datasets containing exclusively insensitive lines are not expected to display significant correlations given the inherent noise in their drug response data.
In cases where direct GDSC-CCLE comparisons were possible, nearly all compounds (13/15) exhibited AUC and IC50 distributions dominated by drug-insensitive lines, with a much smaller number of drug-sensitive outliers. The complete CCLE and GDSC AUC distributions are illustrated in aggregate for each compound by “violin plots” (representative examples are shown in figure 1, and all plots in Extended Data figure 1); results for IC50 values are similar (Extended Data figure 1). Ten compounds (saracatinib/AZD0530, erlotinib, lapatinib, nilotinib, crizotinib, nutlin-3, PD-0332991, PHA-665752, PLX4720, sorafenib) exhibited AUC values skewed heavily toward the drug-insensitive end of the spectrum. Notably, several targeted anticancer drugs had very few (if any) drug-sensitive lines in the overlapping set (e.g., 2 for crizotinib, 3 for nilotinib, 2 for NVP-TAE684, and zero for erlotinib or sorafenib, Figs. 1b,c and Extended Data figure 1). This relative paucity of drug-sensitive cell lines constrained the level of correlation achievable. Nevertheless, a correlation analysis that accounted for the imbalance between the number of sensitive and insensitive cell lines and corrected for differences in the original analytical methodologies yielded good consistency in most cases (Extended Data figure 2 comparing Spearman’s and Pearson’s correlations properties in this context, and supplementary text). New correlation values using the Pearson correlation coefficient instead of Spearman’s, as well as properly capped drug sensitivity metrics were clearly improved for most drugs compared to the earlier comparison study7 (figure 1d and 1e, Methods and Supplementary text). We noted that some correlation values remained poor, either due to differences in actual pharmacological measurements (e.g. nutlin-3, paclitaxel, PHA665752) or because sensitive lines were only present in one of the cell line collections (e.g. erlotinib, sorafenib), preventing any meaningful comparison (figure 1c).
To complement this correlation analysis, we used a waterfall plot-based assessment (Extended Data figure 3 shows a schematic of the workflow and further details are provided in the supplementary text). This analysis confirmed that on average, 94% of cell lines for the 13 relevant compounds (CCLE mean = 94%, range = 77-100%; GDSC mean=96%, range= 86-100%, supplementary data 2) clustered within a drug-insensitive range (e.g., IC50 values of > 1 µM for most compounds). These waterfall analyses also showed a high consistency of cell line categorization as “sensitive” or “resistant” between CCLE and GDSC data (figure 1d, Extended Data figure 3). This consistency was evident even when using a simple drug sensitivity cut-off (1 µM) across all the drugs tested (Extended Data figure 3). Thus, both categorization approaches showed higher consistency than reported in the earlier study7 (see supplementary text).
These results indicated that the CCLE and GDSC cell line pharmacologic screening data are best suited for modeling studies that distinguish rare, drug-sensitive lines from “all others” (e.g., from drug-insensitive lines that are not expected to contribute meaningful molecular or genetic information). Given this, we next considered the extent to which the CCLE and GDSC cell line collections illuminated common genetic or molecular underpinnings of anticancer drug efficacy. Such insights provide one of the most relevant measures for concordance and utility of pharmacologic screening data, given that these efforts are designed to identify such predictors of drug response.
We first conducted an analysis of variance (ANOVA) using only the overlapping lines across the CCLE and GDSC. We considered two models where the predicted variables were IC50 values or activity area (i.e. 1-AUC) scores, respectively. In both models we considered the tissue-of-origin as a covariate and the mutational status of 71 oncogenes as independent variables.
ANOVA identified known genetic biomarkers of sensitivity or resistance as top molecular correlates in at least one dataset for 13/15 compounds, and in both datasets for 8/15 compounds (figure 2a, Extended Data figure 4, Supplementary Data 3). Molecular correlates in both datasets included NRAS mutation and sensitivity to MEK inhibitor PD0325901, BRAF mutations and sensitivity to BRAF inhibitor PLX4720, the BCR-ABL fusion gene and sensitivity to multiple ABL inhibitors (nilotinib, AZD0530) and sensitivity of ERBB2-amplified cells to ERBB2 inhibitor lapatinib (identified when using IC50 values, Extended Data figure 4). Additionally drug resistance associations such as TP53 mutations and resistance to nutlin-3 were recovered consistently using activity area scores. When ANOVA was fitted to activity area, 14 drugs for the GDSC and 15 for the CCLE also showed lineage-specific response associations that were consistent across datasets (systematic t-test; Extended Data figure 5 and Supplementary Data 4,7).
In a more comprehensive assessment of the consistency of genomic predictors, we applied a multivariate analysis across 21,013 genomic features encompassing expression, copy number changes and mutations8,9. Elastic net regression was performed using either the full dataset available for each study or only the overlapping datasets. This analysis yielded robust response predictors, and the overlap of predictors was highly significant (Chi square p < 10-8, Extended Data figure 6, Supplementary Data 5). Here again, known genomic predictors of drug response emerged as top molecular correlates in at least one dataset for 13/15 compounds; 10/15 compounds showed such correlates in both datasets (Supplementary Data 5), as reported previously by CCLE and GDSC using their individual datasets8,9. For some drugs, extending elastic net regression analyses of IC50 values beyond just the overlapping cell lines identified additional genetic predictors of clinical activity. MDM2 expression and TP53 mutation in the case of nutlin sensitivity provide one example. Moreover, among 4957 drug gene associations found using elastic net modeling on each dataset, we only observed one divergent result (0.02%) between the two studies.
To further explore how the two datasets might be leveraged to identify genomic predictors of drug sensitivity, we performed a two-step analysis where predictors were identified using one dataset and their effects were analyzed in the other dataset. Here, we used elastic net regression to identify the genomic features and ridge regression to compare their effect across the datasets (figure 2b and Supplementary text). Additionally, we performed this discovery step either on the overlapping cell lines or on all lines available in the respective studies.
We again observed a high consistency of predictive genomic features identified across the CCLE and GDSC studies, even for drugs where few overlapping cell lines were available. Indeed, >80% of these features identified with concordant directionality in both studies (figure 2c,d, Extended Data Fig 7, 8 and Supplementary Data 6, features with same sign). In some instances, no predictors could be identified by the initial elastic net regression. This was often attributable at least in part to small numbers of drug-sensitive cell lines, as noted above. On the other hand, some drugs that exhibited low correlations based on the AUC or IC50 analyses nonetheless enabled identification of consistent predictors (e.g., nutlin-3; figure 2d).
Together, these results indicate that the CCLE and GDSC pharmacologic datasets exhibit reasonable predictive power both separately and when taken as a whole. Many of the resulting drug response predictions are well validated by prior knowledge and clinical evidence. In this regard, not only do the two sets of drug screening data exhibit broad convergence, they also provide examples of consilience: a phenomenon in which independent lines of experimental evidence, each with their own inherent limitations, arrive at fundamental scientific agreement.
In summary, when analytical and biological considerations are incorporated that reflect the nature of oncogenic dependency, pharmacologic data from the CCLE and GDSC studies exhibit reasonable consistency. Based on positive Pearson correlations (R > 0.5), we observed agreement across the CCLE and GDSC datasets for the majority (67%) of evaluable compounds (two drugs with clear positive regression slopes showed R values just under 0.5 for the IC50 values; Extended Data figure 1). We acknowledge that the consistency is not perfect: numerous methodological components (e.g., numbers of cell lines seeded per well, drug concentration range examined, number of cell doublings achieved, cell viability assays, analytical tools to calculate sensitivity values, etc.) undoubtedly reduced the statistical correlation of the overlapping pharmacological data. Further standardization of such methodologies will certainly improve correlation metrics, and we welcome efforts in this direction. Nonetheless, both the CCLE and GDSC groups used standard methods for testing drug responses in cell lines, and this analysis confirmed that the consistency of their results seems reasonable in light of the aforementioned methodological differences.
The identification of molecular predictors of drug response remains a major challenge for cancer precision medicine. Accordingly, large-scale screening of clinically-relevant compounds across molecularly annotated cancer cell line collections will likely remain a crucial preclinical source for hypothesis generation. The CCLE8 and GDSC9 datasets, the two biggest public collections of genomic and pharmacologic cell line data, have produced largely concordant results thus far, although rigorous comparisons should continue to be performed as these datasets evolve. Although neither dataset is perfect on its own, they have both shown clear utility for predictive modeling studies and, in several cases, convergence onto known biological principles. Principled analytical frameworks (together with improved standardization) may conceivably illuminate additional areas of consilience through comparative studies of other functional screens (e.g., RNAi, CRISPR, phospho-proteomics, etc.) in the future. In all such instances, knowledge of the underlying biology should guide the implementation of those analytical and statistical methods best suited for comparative studies and, more generally, the extraction of meaning from large-scale screening data in cancer and other disease models.
Extended Data
Supplementary Material
Acknowledgements
We thank Todd Golub, Eric Lander, Stuart Schreiber, Paul Clemons and Jeff Engelman for helpful discussions. This work was supported by research grants from the Novartis Institutes for BioMedical Research (CCLE) and by grants from the Wellcome Trust (086357) and the National Institutes of Health (1U54HG006097-01) (GDSC).
Footnotes
Author Contributions NS, LAG, AA, DAH, CHB, SR, JL, JB, GC, RS, WRS, FS, MPM, FI, MM, JS, MRS, UM and MJG conceived the studies, NS, MG, GVK, AA, IP, JL, ML, DS, AK, KV, EJE, MPM, FI, and MM performed analyses, NS, MG, AA, ML, MR, FI, and MM wrote/tested the R code, and NS, MG, AA, LAG, CHB, JL, MPM, and JS wrote the paper.
References
- 1.Sharma SV, Haber DA, Settleman J. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat Rev Cancer. 2010;10:241–253. doi: 10.1038/nrc2820. nrc2820 [pii] [DOI] [PubMed] [Google Scholar]
- 2.Neve RM, et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. 2006;10:515–527. doi: 10.1016/j.ccr.2006.10.008. S1535-6108(06)00314-X [pii] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Caponigro G, Sellers WR. Advances in the preclinical testing of cancer therapeutic hypotheses. Nat Rev Drug Discov. 2011;10:179–187. doi: 10.1038/nrd3385. nrd3385 [pii] [DOI] [PubMed] [Google Scholar]
- 4.Garraway LA, et al. Integrative genomic analyses identify MITF as a lineage survival oncogene amplified in malignant melanoma. Nature. 2005;436:117–122. doi: 10.1038/nature03664. [DOI] [PubMed] [Google Scholar]
- 5.Solit DB, et al. BRAF mutation predicts sensitivity to MEK inhibition. Nature. 2006;439:358–362. doi: 10.1038/nature04304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sos ML, et al. Predicting drug susceptibility of non-small cell lung cancers based on genetic lesions. J Clin Invest. 2009;119:1727–1740. doi: 10.1172/JCI37127. 37127 [pii] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Haibe-Kains B, et al. Inconsistency in large pharmacogenomic studies. Nature. 2013;504:389–393. doi: 10.1038/nature12831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Barretina J, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–607. doi: 10.1038/nature11003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Garnett MJ, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012;483:570–575. doi: 10.1038/nature11005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Papillon-Cavanagh S, et al. Comparison and validation of genomic predictors for anticancer drug sensitivity. Journal of the American Medical Informatics Association : JAMIA. 2013;20:597–602. doi: 10.1136/amiajnl-2012-001442. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.