Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2021 Jan 7;23(2):643–665. doi: 10.1093/biostatistics/kxaa047

Dose–response modeling in high-throughput cancer drug screenings: an end-to-end approach

Wesley Tansey *, Kathy Li 2, Haoran Zhang 3, Scott W Linderman 4, Raul Rabadan 5, David M Blei 6, Chris H Wiggins 7
PMCID: PMC9007438  PMID: 33417699

Summary

Personalized cancer treatments based on the molecular profile of a patient’s tumor are an emerging and exciting class of treatments in oncology. As genomic tumor profiling is becoming more common, targeted treatments for specific molecular alterations are gaining traction. To discover new potential therapeutics that may apply to broad classes of tumors matching some molecular pattern, experimentalists and pharmacologists rely on high-throughput, in vitro screens of many compounds against many different cell lines. We propose a hierarchical Bayesian model of how cancer cell lines respond to drugs in these experiments and develop a method for fitting the model to real-world high-throughput screening data. Through a case study, the model is shown to capture nontrivial associations between molecular features and drug response, such as requiring both wild type TP53 and overexpression of MDM2 to be sensitive to Nutlin-3(a). In quantitative benchmarks, the model outperforms a standard approach in biology, with Inline graphic lower predictive error on held out data. When combined with a conditional randomization testing procedure, the model discovers markers of therapeutic response that recapitulate known biology and suggest new avenues for investigation. All code for the article is publicly available at https://github.com/tansey/deep-dose-response.

Keywords: Deep learning, Dose–response modeling, Drug discovery, Empirical Bayes, High-throughput screening, Personalized medicine

1. Introduction

1.1. High-throughput cancer drug screening

Genomic sequencing and high-throughput drug screening are becoming cheaper, enabling widespread adoption in both research institutions and hospitals (Muir and others, 2016). Many of these institutions have built data sets that enable scientists to explore potential connections between the molecular profile of a cell (i.e., its DNA and other related biological information) and its phenotypic response to treatment with a certain drug. Specifically in cancer therapeutics, large public data sets have become available with thousands of experiments testing different drugs on different types of cancer cell lines (e.g., Yang and others, 2012; Barretina and others, 2012; Iorio and others, 2016; Haverty and others, 2016). One goal in analyzing these data sets is to build a predictive model of drug response. The model takes in a set of molecular features of a cell line and predicts the expected outcome of using a candidate drug to treat it. The more accurate the predictor, the more it can be trusted to faithfully simulate wet lab results. If a good predictor can be constructed, it can accelerate the discovery of targeted therapies by refining experimentalists’ hypotheses much faster than can be done in the lab. Thus, good predictors have high value to biologists.

We build a generative model of drug response in high-throughput cancer drug screens. The model captures uncertainty inherent in cell line experiments, including measurement error, natural variation in cell growth, and drug response heterogeneity. We take an empirical Bayesian approach to estimating the model parameters and also propose a method for detecting contaminated data. We use the model in a case study analyzing a data set (Iorio and others, 2016) containing hundreds of thousands of experiments. In our study, the model outperforms a state-of-the-art approach for estimating dose–response curves (Vis and others, 2016). Model predictions also recapitulate known biology involving nonlinear interactions between molecular features and drugs. Approximate dose–response posteriors, in combination with conditional randomization tests (Candes and others, 2018), reveal both known and potentially novel markers of drug response.

The experimental design used in our case study is similar to many pharmacogenomic profiling experiments. Consequently, we expect the proposed model and insights from our case study to be applicable to other high-throughput cancer drug screening studies. Some studies (Haibe-Kains and others, 2013; Safikhani and others, 2016) have shown results may not transfer between different high-throughput studies, implying the actual model predictions we infer may not perform well on other data sets. Nevertheless, the modeling strategy and techniques we propose for dealing with data-specific challenges in the GDSC can help guide and inform modeling of similar data sets. In the conclusions of the article, we summarize these insights for practitioners analyzing other data sets, designing new experiments in the lab, or guiding data-gathering policy at hospitals.

1.2. Experimental setup and details

Our data come from the Genomics of Drug Sensitivity in Cancer (GDSC) (Yang and others, 2012; Garnett and others, 2012; Iorio and others, 2016), a high-throughput screening (HTS) study on therapeutic response in cancer cell lines. The GDSC data comprise the results of testing Inline graphic cancer cell lines nearly combinatorially against Inline graphic cancer therapeutic drugs, in vitro. The experiments were conducted across two separate testing sites: the Wellcome Trust Sanger Institute (“Site 1”) and Massachusetts General Hospital (“Site 2”). Experiments were carried out over the course of multiple years and used one of two assay types, depending on the cell type: suspended (“Assay S”) or adherent (“Assay A”). For a given site and assay type, each experiment was carried out using a series of HTS microwell plates; the layout for each plate is shown in Figure 1 in the Supplementary material available at Biostatistics online. A given plate contains hundreds of microwells, where each well is designated as either a negative control, positive control, or treatment. Negative control wells are left unpopulated and untreated, so as to calibrate the base level of machine output if all cells were to die. All other wells are populated with a constant volume of cells; due to natural variation in cell volume, exactly how many cells occupy any given well is unknown. Positive control wells are left untreated so as to measure the base response for drugs that have no effect. Finally, a series of drugs are applied to the remaining wells, with each well receiving a different concentration. A single plate is used to test only one specific cell line; all wells in a plate are filled with the specific assay fluid appropriate for the target cell line. Drug concentration ranges were derived based on previous experiments and were delivered at either 5 or 9 dose levels, depending on the testing site. Both testing sites start at the same maximum dose level, which is diluted at a 4-fold rate for the 5 dose schedule and at a 2-fold rate for the 9 dose schedule, resulting in the same final minimum dose. Since the sites use the same minimum and maximum dose, this dilution procedure yields missing data for odd-numbered dose levels in the data set. Once treated, cells are left to either grow or die for 72 h.

Cell population size is approximated by a fluorescence assay. A fluorescent compound is added to the wells that is capable of penetrating cells and binding to a protein kept at relatively-constant levels in all living cells. When a cell dies, its structure breaks down and the target protein denatures, leaving no binding agent and resulting in no fluorescence. Robotically controlled cameras photograph each well and the total luminescence of the image (i.e., pixel intensity count) is used as a relative measure of cell population size. Luminescence of positive and negative control wells is used to calibrate luminescence of treatment wells.

Positive control wells provide an estimate for how much light a well with all living cells will emit, while negative control wells provide an estimate for how much light a well with no living cells will emit.

We refer to this negative control measure as the baseline fluorescence bias, since it represents the fluorescence of an empty well. This baseline measure is also subject to machine or technical error from the specific equipment being used. Positive controls are subject to natural biological variation from the particular cell line being used. Population luminescence after treatment (relative to the positive and negative controls) is the quantity of interest for each treatment microwell. That is, we seek to estimate the survival rate of the population after treatment: populations close to the negative control luminescence will have mostly died; populations close to the positive control luminescence will have mostly survived; in-between levels indicate some partial survival. We aim to build a model that treats the molecular covariates as features that potentially convey predictive information about sensitivity and resistance to different therapies. For Inline graphic of the cell lines, we have molecular information about gene mutations, copy number variations (CNVs), and gene expression. We preprocessed the mutations and CNVs to filter down to genes that are recurrently observed as altered in large-scale observational cancer studies. This preprocessing is necessary, as adjacent CNVs are highly correlated and many nucleotide mutations occur in genes that are known to have no impact on functionality. Biological knowledge from pre-existing databases is leveraged here to filter genes down to a more informative set (see Supplementary material available at Biostatistics online for details). After preprocessing, we are left with Inline graphic binary gene mutations and Inline graphic copy number counts; we keep all Inline graphic gene expression covariates. For a handful of cell lines ( Inline graphic), no molecular information is available; we treat these as missing data. Almost all drugs have been screened against all cell lines, yielding a data set of Inline graphic (cell line, drug) experiments. We treat all missing experiments and all missing molecular features as missing at random.

1.3. Related work

Several methods for dose-response modeling have been proposed in the recent literature. Low-Kam and others (2015) proposed a Bayesian regression tree model for estimating toxicity at different levels of certain nanoparticles. Wheeler (2019) modeled dose–response with molecular descriptors via additive Bayesian splines. Both of these methods, while capable of working from raw data, do not scale to the massive size of our data, where we have tens of thousands of features. Several machine learning approaches to dose–response prediction in cancer experiments have also been proposed (e.g., Menden and others, 2013; Ammad-ud din and others, 2016; Rampášek, 2019). These models seek to predict a single summary statistic (usually the area under the curve or the IC50) of a parametric model fit to the more complicated raw data. Predicting this summary statistic treats the preprocessed model as a ground truth label, making it difficult to know whether improved predictive performance is because the model has learned something about the actual response of the cancer cells, or simply has “fit the fit” by better matching the predictive model to the parametric model. Here, we seek to work end-to-end, predicting directly from the raw data. This prevents us from fairly comparing against most past machine learning approaches that require a clearly defined covariate and scalar response setup. The upside is that modeling the entire experiment enables us to account for measurement error, an important source of noise in scientific data (Loken and Gelman, 2017) that has historically been overlooked in cancer screens. Substantial measurement error (Section 2.3) and outcome uncertainty (Section 2.4) are present in our data set; Section 3 shows the benefits of the end-to-end approach, as well as a comparison to the state-of-the-art approach in computational biology (Vis and others, 2016).

2. Bayesian dose-response model

2.1. Generative model

Consider Inline graphic cell lines and Inline graphic drugs. Each cell line Inline graphic is tested against each drug Inline graphic, at dose levels Inline graphic. The study consists of plates Inline graphic; we denote by Inline graphic the plate on which a specific Inline graphic pair was tested. The result of a plate experiment is fluorescence count data Inline graphic where Inline graphic are the Inline graphic negative control measurements used to estimate the baseline fluorescence bias, Inline graphic are the Inline graphic positive control measurements used to estimate the fluorescence of a population of cells when no effective treatment has been applied, and Inline graphic are the treatment well measurements for each of the drugs on plate Inline graphic. Since each Inline graphic pair is tested only on a single plate, we index treatments by their cell line, drug, and dose levels, respectively; thus, Inline graphic is a 3-tensor and Inline graphic represents the result of treating cell line Inline graphic with drug Inline graphic at dose level Inline graphic. Each cell line Inline graphic has an associated vector of mutation, copy number variation, and expression covariates Inline graphic. We propose a generative model of dose–response,

graphic file with name Equation1.gif (2.1)

The plate-specific negative and positive control fluorescent counts, Inline graphic and Inline graphic, are modeled as random variables for which we receive Inline graphic and Inline graphic i.i.d. observations, respectively. The positive control fluorescence distribution is a function of the baseline fluorescence from the machine ( Inline graphic) and the growth rate ( Inline graphic) of the cell line on plate Inline graphic. The outcome of experiment Inline graphic is a sample from the positive control distribution, but with a treatment effect Inline graphic that reduces the growth rate of the cells.

The treatment effect Inline graphic has the direct interpretation as drug Inline graphic killing Inline graphic of cell line Inline graphic on average when applied at dose level Inline graphic. The vector of responses Inline graphic is the dose-response curve—that is, it represents the percentage of cells that survive at different dose levels; this is the primary quantity of interest in every experiment. Discovering effective drugs corresponds to finding curves that show sensitivity in some subset of cell lines, indicating the therapy is likely targeting some molecular property of the cells.

The dose–response curve is modeled through a latent constrained multivariate normal (MVN). The logistic transform from the MVN variable Inline graphic to the effect Inline graphic constrains effects to be in the Inline graphic interval, corresponding to expected cell survival percentage. The MVN is constrained to the monotone-increasing half-space (i.e., support only on values for which Inline graphic), encoding that the drug effect can only become stronger as the dose increases. The mean response is a monotone function Inline graphic of the molecular features for the target cell line and the ID of the drug to be applied.

The model for Inline graphic encodes several scientific assumptions. We assume no drug has a positive effect on any cell—that is, no drug actually encourages growth. There are two biological motivations for this assumption. First, cancer cells are generally defined by uncontrolled proliferation. Cultivating non-cancerous cells in vitro is extremely difficult as most cells induce apoptosis (cell suicide) outside of their host environment. There is little room left biologically for a drug to encourage the cancer cell lines to grow even more. Second, the drugs chosen for the GDSC experiment have all been selected for their ability to stress and kill cells. A large portion of the drugs are established cancer therapeutics that are designed to kill cells by targeting pathways recurrently found altered in certain cancers. In addition to assuming all drugs do not encourage growth, we also assume that toxicity only increases with drug concentration. This assumption is again based on the notion of cancer drugs being highly toxic. We check these assumptions empirically in the Supplementary material available on Dryad.

2.2. Overview of empirical Bayes inference approach

In practice, simultaneous estimation of all parameters in (2.1) is computationally infeasible.

Furthermore, we discovered contamination in the GDSC data that must be handled carefully (see the Supplementary material available at Biostatistics online for details). Instead, the prior parameter estimates Inline graphic are obtained through a stepwise procedure. At a high level, there are four main steps:

  • 1. Baseline fluorescence bias rate Inline graphic is estimated from negative controls. These controls contain systematic biases due to technical error; we detail a denoising approach for negative controls in Section 2.3.

  • 2. Positive control priors Inline graphic and Inline graphic are estimated by maximum likelihood for each plate in Section 2.4.

  • 3. The black box predictive model Inline graphic is chosen to be a neural network; parameters Inline graphic are estimated in Section 2.5 by maximizing the marginal log likelihood of the data with fixed control priors and an identity covariance matrix.

  • 4. Drug-specific dose–response correlation structure Inline graphic is estimated in Section 2.6 through cross-validation and maximum likelihood with fixed prior means Inline graphic.

The final model enables empirical Bayes posterior inference on Inline graphic, the dose–response curve. The remainder of this section details the four estimation steps in our estimation procedure.

2.3. Batch effects

An assumption in high-throughput cancer drug analyses is that each individual well is independent of the other wells on the same plate, and each plate is independent of subsequent plates. There is growing concern about both assumptions among biologists. Recent studies suggest there is substantial spatial bias in microwell assays (Lachmann and others, 2016; Mazoure and others, 2017), inducing dependence among individual wells on the same plate. We generally label this “cross-contamination” since microwell results are spilling over into neighboring wells. This cross-contamination leads us to throw out several negative control wells in each plate; see the Supplementary material available at Biostatistics online for details on removing cross-contaminated wells. After decontamination, some plates are left with as few as five negative control wells.

The second assumption is that each plate of observations is independent, regardless of when or where it was tested; as we show, this assumption is also violated for the GDSC data. Figure 1 (left) shows one of the four (site, assay) stratifications, with negative control medians for each plate in gray. The x-axis in each subplot is the date ID of the plate (the date when the plate was screened); all dates are relative to the first day of the study. There is clear visual evidence of temporal dependence, with trends followed by sharp discontinuities.

Fig. 1.

Fig. 1

Left panel: Time-evolving estimate of the negative control density for an example testing site and assay type. The x-axis is the date a plate was screened, relative to the start of the study; y-axis is the median control well measurements after removing contaminated wells. The trend filtering density regression fit is in orange (line: mean, bands: Inline graphic regions). Right panel: Maximum likelihood estimate of an example positive control density, after fitting the negative control MAP estimate.

This temporal dependency is well-established in the HTS literature (Johnson and others, 2007; Leek and others, 2010) and falls broadly under the term “batch effects.” These are technical artifacts that cause otherwise-independent experiments to yield dependent results. Myriad causes can introduce temporal batch effects: using the same test site, the same equipment, preparation by the same technician, or even conducting the experiment at the same ambient temperature. Generally batch effects are seen as a nuisance that reduces the signal in experimental data and must be removed as a preprocessing step, if possible.

We instead use the temporal batch effects to our advantage when estimating the negative controls. We leverage the inter-well dependence through an empirical Bayes procedure that shrinks the differences between median estimates of negative control wells for plates screened on similar days. We use the relevant portion of the generative model for a single negative control well,

graphic file with name Equation2.gif (2.2)

where Inline graphic denotes the specific day Inline graphic that plate Inline graphic was screened. We use the median observation Inline graphic as a noisy approximation to the true rate. We then fit a time-evolving density to model the prior distribution of the medians on each day,

graphic file with name Equation3.gif (2.3)

where Inline graphic is the Inline graphic-order trend filtering matrix (Tibshirani, 2014) using the falling factorial basis (Wang and others, 2014) to handle the irregular grid of days. The solution to (2.3) finds densities that are piecewise-linear in their mean and piecewise constant in their variance.

The regularization parameters Inline graphic and Inline graphic are chosen via 5-fold cross-validation and the model is fit with stochastic gradient descent. Since the counts are large, we use a normal approximation to the gamma, parameterized in natural parameter space for computational convenience; shape and rate parameters are reconstructed from the mean and variance of the normal. Since (2.3) is non-convex, we only find a local optimum, but empirically we observe good fits across a wide array of simulated data. The orange line and bands in the left panel of Figure 1 show the results for one stratification, where the outliers have been shrunk substantially.

Given the learned prior Inline graphic from (2.3), we calculate the maximum a posteriori (MAP) estimate for the negative control mean,

graphic file with name Equation4.gif (2.4)

While there is technical variation in the bias rate, it is relatively small after corrections and thus as a practical matter we simply use the MAP estimate for the Poisson rate of the negative controls. Using the MAP estimate with a fixed Inline graphic simplifies downstream inference by removing the need to numerically integrate out the negative control.

2.4. Natural variation in cell line growth

Even under ideal conditions without any batch effects or spatial plate bias, cell population growth and response exhibits a large degree of natural variation. For instance, Figure 1 (right) shows the distribution of the positive control wells for one example plate. The variance in this cell line is so large that a decrease in fluorescence of even Inline graphic compared to the mean would not be highly unlikely. Furthermore, population growth variance varies substantially between cell lines, testing sites, and assay types.

Since we have a reasonably large number of positive control wells on each plate, we estimate the population directly,

graphic file with name Equation5.gif (2.5)

The integral in (2.5) can be resolved analytically to be an incomplete gamma; however, we found a finite grid approximation to be more numerically stable. The maximum likelihood problem is also nonconvex, so we again rely on a local optima approximation found via a sequential least squares solver. We found the fits in simulation to be close to the ground truth when using the same number of control replicates as in the GDSC data. The orange line in the right panel of Figure 1 shows the resulting fit from optimizing (2.5) on the example plate.

2.5. Fitting the black box response prior

We use a deep neural network for Inline graphic, parameterized by weights Inline graphic. The molecular feature Inline graphic is passed through a Inline graphic feed forward network. The outputs Inline graphic for each of the Inline graphic drugs and Inline graphic doses are then constrained to be monotone in Inline graphic,

graphic file with name Equation6.gif (2.6)

where the right-hand side of (2.6) is the cumulative sum of the softplus operator applied to the raw outputs from dose Inline graphic upward, allowing the maximum dose level to set the offset. We fix Inline graphic to the identity matrix; we found the results did not improve by using off-diagonal terms.

We optimize Inline graphic by maximizing the log-likelihood of the data,

graphic file with name Equation7.gif (2.7)

where

graphic file with name Equation8.gif

and Inline graphic is the logistic function. For Inline graphic, we plug in the identity matrix as a placeholder and re-estimate it in Section 2.6. We approximate the inner integral in (2.7) with a numeric grid over the values of Inline graphic.

We split the data into Inline graphic cross-validation folds. For each fold, we use Inline graphic of the in-sample data as training and Inline graphic as validation for early-stopping. We optimize (2.7) using RMSprop (Tieleman and Hinton, 2012) for Inline graphic epochs with Inline graphic samples per mini-batch. We check empirical risk on the validation set after every epoch and keep the best model over the entire run. The final model predictions on the out-sample (test) fold are then used for evaluation of the prior in Section 3.

2.6. Estimating dose–response covariance and posterior dose-response curves

After fitting Inline graphic, we reuse the validation set to estimate the covariance matrix for each drug. Fixing the predicted means for each test set, we estimate the drug dose–response covariance as the maximum likelihood estimator,

graphic file with name Equation9.gif (2.8)

where Inline graphic is the empirical estimate of cell survival (i.e., without prior information from the features or any constraint on the curve shape).

The prior estimates are then fixed and the posterior distribution of Inline graphic, the latent dose-response rate for each experiment, is estimated via Markov chain Monte Carlo (MCMC).

We evaluated two MCMC methods: (1) a fully conjugate Gibbs sampler implemented via Polya-Gamma augmentation (Polson and others, 2013), where the sampled posterior MVN logits are projected to be monotone using the pool adjacent violators algorithm as in Lin and Dunson (2014), and (2) rejection sampling with an elliptical slice sampling (Murray and others, 2010) proposal. We found the Gibbs sampler to have high sample complexity due to the need to sample both Inline graphic and Inline graphic, where the value of one tightly constrains the distribution over the other. The elliptical slice sampler, by contrast, approximated the posterior better with fewer samples and only required a single MCMC chain. We ran the elliptical slice sampler for Inline graphic iterations with the first Inline graphic iterations discarded as burn-in samples.

3. Model comparison and evaluation

3.1. Baseline approach

We compare the proposed Bayesian approach to the existing dose–response curve fitting pipeline (Vis and others, 2016) used for the GDSC data set. The pipeline is similar to those used in other high-throughput cancer drug screening analyses (e.g., Barretina and others, 2012). Unlike in the Bayesian approach, the baseline method does not account for measurement error, model natural variation, or quantify its uncertainty.

First, negative and positive controls are averaged to obtain Inline graphic and Inline graphic. These are point estimates of baseline fluorescence when all cells die or survive, respectively, on plate Inline graphic. The control point estimates are then used to calculate the expected percentage of cells that survived each treatment experiment,

graphic file with name Equation10.gif (3.9)

The Inline graphic estimates are then treated as observations of the percentage of cells surviving. A logistic curve is fit for every Inline graphic pair using a multilevel mixed effects model,

graphic file with name Equation11.gif (3.10)

where Inline graphic and Inline graphic enable the model to share statistical strength between drugs tested on the same cell line; Inline graphic and Inline graphic are offsets with no prior specified; priors for the covariance Inline graphic and variance Inline graphic are likewise not specified. The model is fit by maximum likelihood estimation in the R package nmle. Curves are summarized by integrating out the dose parameter Inline graphic to obtain a summary statistic, such as the estimated concentration required to kill Inline graphic of cells (IC50). Log-IC50 values are used as targets for predictive modeling of molecular features. For each drug, an elastic net (Zou and Hastie, 2005) model is fit with hyperparameters chosen through cross-validation.

3.2. Performance comparison

We compare the Bayesian model in (2.1) to the above baseline approach. To compare the two methods, we consider the task of imputing a missing dose given observations of the other dose levels in the experiment. This checks the ability of the model to capture the shape of the dose-response curve. For each (cell line, drug) experiment, we hold out one dose level at random and treat it as missing data that must be imputed. Error is measured in terms of variance-adjusted raw count values on the held out data,

graphic file with name Equation12.gif (3.11)

where Inline graphic is the model prediction and Inline graphic is the standard deviation of the raw positive controls. For the baseline, raw predictions are backed out from (3.9) and (3.10),

graphic file with name Equation13.gif (3.12)

For the Bayesian model, we use a MAP estimate of the raw count,

graphic file with name Equation14.gif (3.13)

where Inline graphic is the posterior mean estimate of Inline graphic in (2.1). We also consider a hybrid version of the baseline that uses only the control correction technique. For this method, we replace the control means with the approximate Bayes estimates,

graphic file with name Equation15.gif (3.14)

In the hybrid model, the controls are used somewhat differently in two ways. First, the raw controls are used by the baseline method to estimate Inline graphic then the corrected controls are used to estimate Inline graphic.

Table 1 presents the results for this benchmark. Using the corrected controls in the hybrid model improves the predictions of the baseline model slightly, suggesting the correction procedures from Sections 2.3 and 2.4 are useful independent of the Bayesian model. However, the Bayesian model outperforms the predictions from the corrected baseline model by Inline graphic. We also investigated a featureless version of the Bayesian model and found similar improvements over the baseline, suggesting the improvements are due to the flexibility of the constrained multivariate normal prior. The log-linear constraint of the baseline model imposes a strong assumption that the true dose response curve has a sigmoidal shape. By contrast, the Bayesian model only assumes monotonicity and learns the shape priors from the data.

Table 1.

Mean squared error results on the single-dose imputation benchmark relative to the baseline model from Vis and others (2016). The baseline is slightly improved by using corrected controls (Hybrid), but overall is not flexible enough to fully model the observed dose-response curves; the Bayesian model outperforms both pipelined approaches.

  (min) Dose level (max)  
Model 0 1 2 3 4 5 6 7 8 All
Baseline 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0
Hybrid 1.04 1.00 1.04 0.97 0.95 0.91 0.90 0.93 1.03 0.98
Bayesian 0.92 0.86 0.71 0.69 0.67 0.79 0.77 0.96 0.86 0.82

3.3. Assessing feature importance

A current debate in precision oncology is whether all molecular feature types contain predictive power. Gene panels used in hospitals, such as MSK-IMPACT (Cheng and others, 2015), only consider mutations and copy number, while others argue that gene expression is sufficient to predict response (Piovan and others, 2013; Rodriguez-Barrueco and others, 2015). We investigate this here by fitting separate models for every possible subset of features (mutations, copy number variations, and expression). If a feature set does not contain worthwhile information, it will effectively introduce noise into the model and either not improve performance or lower it (e.g., due to finite samples and a nonconvex optimization procedure).

To measure predictive power of a model, we consider the task of predicting entire out-of-sample experiments at all dose levels. We again measure error on variance-adjusted raw count predictions. However, the curve prediction task is a prior predictive check of the model, rather than a posterior one. This is a more challenging task, as the model does not see any outcomes from the specific experiment when making predictions. We use the logistic-transformed mean as the predicted drug effect,

graphic file with name Equation16.gif (3.15)

The approach of Vis and others (2016) does not provide a way to make raw predictions from feature subsets; we therefore only consider the Bayesian model for this task.

We evaluate the predicted priors from the held out cross-validation folds, across all folds. We measure mean error by first taking the average error on the entire curve, then averaging across all curves. Table 2 shows the curve prediction results. The model generally has lower error as more feature subsets are included, suggesting that all three feature subsets add predictive value. We also considered all features in a simple linear model, as a validation of the neural network approach itself. The results in the next to last line of Table 2 show that the linear model performs worse than the neural network model.

Table 2.

Mean error results on the curve prediction benchmark relative to using all molecular features. Overall performance generally increases as more features are added, suggesting each subset conveys valuable predictive information not captured by the other two. Further, using a simpler linear model with all features (next to last line) performs much worse than the neural network model.

  (min) Dose level (max)  
Model 0 1 2 3 4 5 6 7 8 All
Mutations 1.06 1.08 1.07 1.06 1.04 1.03 1.02 1.02 1.00 1.03
CNV 1.04 1.06 1.07 1.08 1.09 1.11 1.11 1.13 1.12 1.10
Expression 1.01 1.03 1.02 1.03 1.02 1.03 1.01 1.00 1.02 1.02
Mut+CNV 1.05 1.04 1.06 1.05 1.04 1.03 1.01 1.02 1.00 1.03
Mut+Exp 1.00 1.01 1.01 1.01 1.01 1.00 1.00 0.99 0.99 1.00
CNV+Exp 0.99 1.00 1.00 1.01 1.01 1.01 1.01 1.00 1.00 1.01
All (Linear) 3.42 3.56 3.16 3.05 2.80 2.63 2.48 2.35 2.24 2.59
All 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

3.4. Qualitative evaluation

In addition to the quantitative results above, we also evaluate whether the model recapitulates known biology in its learned prior. This qualitative check provides reassurance that the patterns discovered by the model are reliable and not likely just functions of undetected experimental artifacts. We generated marginal prior predictive curves with different subsets of drugs and cell lines. The subsets of cells are chosen based on features that are known to be targeted by certain drugs. Drugs that target one subset should produce more sensitive marginal predicted response curves.

BRAF inhibitors:

Six of the drugs in the data set are labeled as BRAF inhibitors. These drugs all target a well-studied, commonly mutated oncogene called Type-B Rapidly Accelerated Fibrosarcoma (BRAF). Mutations in this gene, particularly mutations in the codon 600 (V600E), lead to activation of the Mitogen-activated protein kinase (MAPK) (MEK/ERK) signaling pathway in tumor cells. As shown in the top left panel of Figure 2, the model successfully captures the differential efficacy of BRAF inhibitors between cell lines possessing mutant and wild type variants.

Fig. 2.

Fig. 2

Prior predictive checks recapitulating known biology. Top left: cell lines with and without a mutated BRAF gene, when treated with any of the six drugs designed to target such cells. Top right: Drugs targeting MAPK signaling that shut down the aberrant signaling caused by mutations in RAS-type genes. Bottom left: Nutlin-3(a) (an MDM2 inhibitor) is effective only when both MDM2 levels are high and TP53 is not mutated. Bottom right: PI3K inhibitors target cells exhibiting Phosphatase and tensin homolog (PTEN) loss and PIK3CA elevation.

MAPK signaling:

Eighteen of the drugs in the data set target signaling in the MAPK pathway. The MAPK signaling pathway is involved in cell growth and proliferation and one of the most commonly activated pathway across tumors. Kirsten rat sarcoma (KRAS) and Neuroblastoma rat sarcoma (NRAS) are members of the RAS gene family, which constitute central nodes in the MAPK pathway. RAS proteins activate other proteins that in turn drive cell growth. Mutations in KRAS and NRAS can lead them to be permanently activated, causing continual growth signaling downstream in the MAPK pathway. The drugs we consider here target the MAPK pathway in various ways to downregulate its activity; specifically, the drugs included here target BRAF, RSK2, ERK1/2/5, and MEK1/2/5. The top right panel of Figure 2 shows that the model successfully predicts more sensitivity to these drugs when a cell line has a mutation in one of these two RAS genes.

Nutlin-3(a):

The drug Nutlin-3(a) binds to and suppresses MDM2. MDM2 is a negative regulator of P53. TP53, the gene that codes for the P53 protein, is the most commonly mutated gene in cancer and is known as the “guardian of the genome” for the multiple roles that plays regulating the damage to DNA. When DNA damage is found, TP53 will initiate a transcriptional program to attempt to repair the damage or, if the damage is too severe, will initiate apoptosis and force the cell to die. When TP53 is mutated or suppressed, these safety mechanisms are disabled and they allow downstream damage to occur unchecked, leading to oncogenic growth. MDM2 targets TP53 for proteosomal degradation and when elevated can suppress TP53 entirely. Thus, in order for Nutlin-3(a) to be effective, we need two conditions to be true: (i) MDM2 needs to be elevated above normal levels such that it functionally inactivates TP53 and (ii) TP53 must not be mutated, such that if MDM2 is suppressed then TP53 will be functional and capable of initiating apoptosis. In this case, we consider a cell line to have an elevated level of MDM2 if it is expressed at least one standard deviation above the mean expression level in the data set. The bottom left panel of Figure 2 confirms that the model predicts this nonlinear efficacy pattern.

PI3K inhibitors:

There are Inline graphic drugs in the data set that are PI3K inhibitors. These drugs broadly target the PI3K-AKT-mTOR pathway, which plays a prominent role in cell growth and division. Cells that exhibit oncogenic malfunctions in this pathway are characterized by two common hallmarks: (i) a loss or mutation of the tumor suppressor gene PTEN and (ii) elevated levels of the oncogene PIK3CA. As in the Nutlin-3(a) scenario, these two show a nonlinear interaction that is replicated by the learned prior, as seen in the bottom right panel of Figure 2.

The qualitative checks above provide a useful reassurance, but they are not foolproof. Namely, these are checks about marginal feature importance. Most features are highly correlated with each other. This correlation may lead a marginal test to indicate the model is capturing biological knowledge about an important feature when it is really learning information about other features that are correlated with the chosen examples. As a final validation of our approach, we use the posterior dose–response curves to test for conditional feature importance.

4. Discovering markers of cell line drug response

4.1. Formulating marker detection as a multiple testing problem

In addition to predictive modeling, a second goal in cancer drug screening is to detect biological markers, e.g., genomic or transcriptomic variations in a sample, that allow us to predict that a sample will be sensitive or resistant to therapy. These markers represent potential causal links that, if validated in wet lab experiments and clinical trials, can be used to craft personalize treatments for patients. Note that while it is impossible to assess their clinical impact on humans, screening for cell line markers of therapeutic response enables scientists to form hypotheses about potential clinical biomarkers to investigate in follow-up studies.

Statistically, we formulate identification of cell line drug response markers as a multiple hypothesis testing problem over a set of binary features. The null hypothesis is that a feature Inline graphic is conditionally independent of response to drug Inline graphic, given all other features,

graphic file with name Equation17.gif (4.16)

We argue that testing for conditional independence is more appropriate for marker discovery than the current approaches often employed in biology. Existing approaches in the literature either test for marginal dependence (Yang and others, 2012) or use heuristic feature selection techniques like lasso regression (c.f., Barretina and others, 2012). The marginal dependence case leads to many spurious false positives due to the high degree of dependence between features. Heuristic methods, on the other hand, do not offer quantification of uncertainty nor provide a way to control false positives. By contrast, conditional independence testing corrects for feature dependence and casting the problem as hypothesis testing enables us to apply existing techniques for multiple hypothesis correction to achieve finite-sample control over the target error rate (e.g., Type I error or false discovery rate).

We define cell line response markers to be binary features. In this formulation, a sample is either positive or negative for a marker. We convert all molecular features to binary markers and test each for conditional independence. Specifically, we consider the following candidate features as potential markers:

  • Mutations. These features are already binary in nature.

  • Copy number gain and loss. We define copy number gain (loss) binary features as a copy number greater (less) than the median copy number for the cell line. For normal diploid human cells this is Inline graphic, however cell lines exhibit high degrees of anuploidy; the median copy number of cell lines in the GDSC is Inline graphic.

  • Over and under expression. We define a gene to be over (under) expressed if it has expression of greater (less) than Inline graphic standard deviation above (below) the mean.

We discard any feature with less than Inline graphic positive and negative samples to lower the likelihood of spurious results. The final data set has Inline graphic binary features which we test across all Inline graphic drugs for a total of Inline graphic M hypotheses.

Binary markers have some important advantages over simply using continuous or ordinal feature values like raw expression levels. Statistically, splitting the features into over- and under-expressed, or gain and loss, enables us to test directional information that a simpler hypothesis test about simply the gene expression itself would not. It also reduces the statistical estimation problem needed to conduct the hypothesis tests (see next section), effectively reducing the need to estimate the entire distribution with only needing to estimate a probit-style cutoff. Computationally, reducing all features to be binary enables us to cache a large number of operations that makes practical implementation of the Inline graphic M hypothesis tests feasible. Finally, we believe the binary on-off nature of the markers adds to the interpretability of the results for scientists that can easily communicate in terms of copy gains and losses or up- and down-regulation of genes.

4.2. Amortized conditional randomization testing

To test for conditional independence, we develop a new type of conditional randomization test (CRT) (Candes and others, 2018). As with any randomization test, the CRT repeatedly samples from the null distribution to build up an empirical distribution of an arbitrary test statistic. The observed data can then be compared against the null samples to calculate a Inline graphic-value for each null hypothesis.

In the conditional independence case, the null distribution is Inline graphic. In practice, the null distribution must be estimated from the data. With thousands of features, fitting this many models can be computationally prohibitive. Rather than fitting conditional distributions for each feature, we posit a factor model of the data set,

graphic file with name Equation18.gif (4.17)

Given the factor model assumption, we test the surrogate null hypothesis that conditions only on the latent factor,

graphic file with name Equation19.gif (4.18)

If the factor model assumption holds, then the right hand side null hypothesis is a superset of the conditional independence null hypothesis. This is because Inline graphic contains at least all of the information of Inline graphic in (4.17).

The factor model assumption enables estimation of the conditional model to be amortized: rather than fit individual conditional models, we can fit a single factor model to the data set and sample from Inline graphic, where Inline graphic is the estimated latent factors. Concretely, we fit a logistic factor model to the binary data set. We include tissue type as Inline graphic binary columns to ensure tissue type does not act as a latent confounder. We use Inline graphic latent factors and optimize the model until convergence to a local optimum.

We calculate Inline graphic-values by sampling up to Inline graphic null samples, calculating their test statistics, and comparing with the true test statistic. We target a Inline graphic false discovery rate using Benjamini-Hochberg (BH) correction for multiple tests. To speed up computation, we early-stop Inline graphic-values if they are larger than Inline graphic for Inline graphic; these larger Inline graphic-values will not pass a BH correction step for a Inline graphic False discovery rate (FDR), making this a conservative step.

4.3. Safety, toxicity, and targeted efficacy

Conditional randomization testing requires the specification of a test statistic. We propose a new dose-response summary statistic: targeted efficacy. Drugs with a high targeted efficacy score for a given marker have a strong differential response probability when segregating the samples based on the marker. The statistic reports the degree to which a drug targets a subpopulation of samples with or without the marker and remains innocuous for the other subpopulation.

We first define what it means for a drug to be safe or toxic. These are subjective decisions that can be informed by biological knowledge in collaboration with experts in a specific therapy or cancer. After conversations with pathologists at Columbia University Medical Center, we chose to define a drug as safe if at least Inline graphic of cells were expected to survive treatment; toxicity was chosen to be a survival rate of at most Inline graphic. To calculate the safety statistic for drug Inline graphic at dose level Inline graphic, we aggregate over all samples for which the target marker is negative,

graphic file with name Equation20.gif (4.19)

Similarly, we define toxicity as having a survival rate of at most Inline graphic,

graphic file with name Equation21.gif (4.20)

Safety and toxicity probabilities are then combined to assess overall efficacy of the drug. We calculate efficacy by taking the minimum of the safety and toxicity probabilities at each dose, then reporting the score at the best observed dose,

graphic file with name Equation22.gif (4.21)

Finally, a marker may be an indicator of sensitivity or resistance. The targeted efficacy score takes the most effective of the two possibilities by looking at efficacy in the marker positive and negative groups,

graphic file with name Equation23.gif (4.22)

Targeted efficacy is distinguished from common measures of drug efficacy, like IC50 and area under the dose–response curve (AUC), in that it is tied directly to the marker of interest. Further, rather than summarize entire curves or projecting to the Inline graphic survival point, targeted efficacy aims to capture whether a specific, viable dose is effective for samples with such a marker. It also requires a probabilistic modeling approach that can quantify response uncertainty, making it incompatible with traditional point-estimation approaches to dose–response modeling.

4.4. Discoveries

The full list of Inline graphic-values for all Inline graphic markers on all drugs is available in the Supplementary material available at Biostatistics online. Table 3 highlights three subsets of these markers. For each marker, we list the following information:

Table 3.

A subset of the drug response markers discovered with Inline graphic. The discoveries for BRAF mutations and Nutlin-3(a) mirror the results of the qualitative model evaluation. The BCR-ABL fusion markers recapitulate known results in the literature as well as suggest new avenues for investigation.

Drug Gene Marker Indicates Score Pos Neg Inline graphic -value
BRAF mutated
CI-1040 BRAF Mutated Sensitivity 0.53 69 752 Inline graphic
Dabrafenib BRAF Mutated Sensitivity 0.62 78 813 Inline graphic
PLX-4720 BRAF Mutated Sensitivity 0.49 70 762 Inline graphic
PLX-4720 BRAF Mutated Sensitivity 0.53 81 820 Inline graphic
Selumetinib BRAF Mutated Sensitivity 0.55 79 824 Inline graphic
Nutlin-3(a)
Nutlin-3a APOBEC3H Overexpressed Sensitivity 0.54 57 775 Inline graphic
Nutlin-3a BAX Overexpressed Sensitivity 0.55 133 699 Inline graphic
Nutlin-3a CYP8B1 Mutated Sensitivity 0.65 5 827 Inline graphic
Nutlin-3a DCST2 Mutated Sensitivity 0.65 6 826 Inline graphic
Nutlin-3a DDB2 Overexpressed Sensitivity 0.55 148 684 Inline graphic
Nutlin-3a EDA2R Overexpressed Sensitivity 0.56 126 706 Inline graphic
Nutlin-3a FNIP1 Mutated Sensitivity 0.65 6 826 Inline graphic
Nutlin-3a FUT7 Overexpressed Sensitivity 0.60 37 795 Inline graphic
Nutlin-3a GYG1 Overexpressed Sensitivity 0.52 94 738 Inline graphic
Nutlin-3a MDM2 Overexpressed Sensitivity 0.54 106 726 Inline graphic
Nutlin-3a PBXIP1 Mutated Sensitivity 0.65 6 826 Inline graphic
Nutlin-3a QSOX1 Mutated Sensitivity 0.67 5 827 Inline graphic
Nutlin-3a RPS27L Overexpressed Sensitivity 0.55 137 695 Inline graphic
Nutlin-3a TP53 Mutated Resistance 0.48 485 347 Inline graphic
Nutlin-3a ZMAT3 Overexpressed Sensitivity 0.56 133 699 Inline graphic
BCR-ABL fusions
Axitinib BCR-ABL Mutated Sensitivity 0.94 6 824 Inline graphic
Bosutinib BCR-ABL Mutated Sensitivity 0.84 6 824 Inline graphic
Cabozantinib BCR-ABL Mutated Sensitivity 0.73 5 890 Inline graphic
CHIR-99021 BCR-ABL Mutated Sensitivity 0.54 6 890 Inline graphic
CP466722 BCR-ABL Mutated Sensitivity 0.61 5 894 Inline graphic
FR-180204 BCR-ABL Mutated Sensitivity 0.56 5 890 Inline graphic
HG6-64-1 BCR-ABL Mutated Sensitivity 0.80 5 854 Inline graphic
JQ1 BCR-ABL Mutated Sensitivity 0.74 6 890 Inline graphic
Masitinib BCR-ABL Mutated Sensitivity 0.81 5 892 Inline graphic
Methotrexate BCR-ABL Mutated Sensitivity 0.70 6 823 Inline graphic
NG-25 BCR-ABL Mutated Sensitivity 0.98 5 890 Inline graphic
Nilotinib BCR-ABL Mutated Sensitivity 0.95 6 779 Inline graphic
NVP-BHG712 BCR-ABL Mutated Sensitivity 0.98 5 890 Inline graphic
Palbociclib BCR-ABL Mutated Sensitivity 0.57 6 803 Inline graphic
Ponatinib BCR-ABL Mutated Sensitivity 0.80 5 854 Inline graphic
QL-XI-92 BCR-ABL Mutated Sensitivity 0.71 5 892 Inline graphic
Tamoxifen BCR-ABL Mutated Sensitivity 0.61 6 902 Inline graphic
Tivozanib BCR-ABL Mutated Sensitivity 0.80 5 892 Inline graphic
TL-1-85 BCR-ABL Mutated Sensitivity 0.82 5 890 Inline graphic
TL-2-105 BCR-ABL Mutated Sensitivity 0.80 5 894 Inline graphic
Veliparib BCR-ABL Mutated Sensitivity 0.63 6 824 Inline graphic
  • Drug: The drug being tested.

  • Gene: The gene where the molecular feature is located.

  • Marker: The molecular feature type representing the marker (e.g., mutated or overexpressed).

  • Indicates: Whether a positive value for this marker indicates sensitivity or resistance to treatment with the drug.

  • Score: The targeted efficacy score.

  • Pos & Neg: The number of positive and negative samples in the GDSC data set.

  • Inline graphic -value: The associated Inline graphic-value for the marker conditional independence test.

Any marker where the targeted efficacy score was higher than all Inline graphic null samples is reported as being less than Inline graphic; additional randomizations could be run to derive a more precise value.

The first two results, BRAF mutations and Nutlin-3(a), support the qualitative findings by reporting well-known markers with a strong biological basis. Specifically, the drugs identified as having BRAF mutation as a marker for sensitivity are all either BRAF or MEK inhibitors. As noted in Section 3.4, these inhibitors target signaling in the MAPK pathway and BRAF mutations (particularly V600E mutations) are expected to indicate sensitivity to treatment by all the drugs discovered. Similarly, Nutlin-3(a) is found to have TP53 mutations as an indicator of resistance and MDM2 overexpression as an indicator of sensitivity, again matching the expected biological response described in Section 3.4. Along with these two markers, several other well-known markers of Nutlin-3(a) sensitivity are found, such as overexpression of the apoptosis activator gene BAX (Toshiyuki and Reed, 1995) and the DNA damage-binding gene DDB2 (Shangary and Wang, 2008).

The third set of results focuses on fusions of the genes Breakpoint cluster region (BCR) and Abelson murine leukemia (ABL), which are coded as mutations in our feature set. The detected set of drugs for this marker include several drugs (e.g., Nilotinib and Bosutinib) that are well-known to target BCR-ABL fusions (Rix and others, 2007). The list also recapitulates the potent inhibition of Axitinib in BCR-ABL cell lines (Pemovska and others, 2015). Interestingly, cell lines with BCR-ABL fusions appear to be sensitive to other kinase inhibitors (e.g., NG-25 and NVP-BHG712) whose function on BCR-ABL has not been previously described. This suggests a broader activity of some of these drugs that have to be validated experimentally.

The discoveries made suggest many potential causal drivers of sensitivity and resistance. However, the Bayesian model does not replace the need for validation experiments. Instead, it enables biologists to guide their experimental planning by suggesting potential drivers of sensitivity and resistance. It is possible that latent confounders, such as DNA methylation, are correlated with the features that we explored. These latent confounders represent a backdoor path (Pearl, 2009) through which a noncausal feature may appear to represent a causal link. Making strong causal inference statements about drug sensitivity would require follow-up wet lab experiments that directly intervene on the candidate marker.

5. Discussion

5.1. The benefits of modeling uncertainty

Predictive models for cancer cell line drug response enable science to move at a faster pace. If a predictor can faithfully replicate the outcome of a wet lab experiment, scientists can screen drugs quickly in simulation to find potential therapies worth investigating. The usefulness of good predictors has led biologists to organize predictive modeling competitions to crowdsource better predictors (Costello and others, 2014) and to build bespoke machine learning models to predict drug response (Menden and others, 2013; Ammad-ud din and others, 2016; Rampášek, 2019). While these efforts have led to models with improved predictive performance, the target of prediction is a summary statistic derived from preprocessing pipelines such as the approach in Section 3.1.

Compressing each experiment down to a single point estimate of a summary statistic is problematic. At each step in the pipeline, simplifying assumptions remove structure from the model and obfuscate the inherent uncertainty in the measurements, effects, and outcomes. Specifically, (i) averaging the negative controls ignores measurement noise and technical error; (ii) averaging positive controls fails to capture natural variability in cell growth; (iii) a log-linear dose model makes strong assumptions about the effect of different drug concentrations; (iv) the summary statistic reported contains no information about the uncertainty in effect size of a given drug at any specific dose; and (v) drug response marker discovery does not account for uncertainty in the summary statistics.

With these considerations in mind, we proposed a Bayesian approach to modeling dose-response in high-throughput cancer cell line experiments. The Bayesian model addresses the issues with the typical pipeline approach by directly modeling uncertainty at every step: (i–ii) both positive and negative controls are treated as random quantities, with plate-level uncertainty quantification of machine bias and cell growth; (iii) the dose effect is constrained to be monotonic, but is otherwise fully flexible and not limited to the log-linear regime; (iv) the model enables empirical Bayes posterior inference over the entire dose–response curve for every experiment; and (v) the posterior distributions are used to detect drug response markers through a new probabilistic test statistic.

The Bayesian model outperforms the pipelined approach in benchmarks on the GDSC data set. However, there is still ample room for improvement. The neural network architecture and training method we used was not explored extensively. Other models or architectures such as those from the existing literature in computational biology (e.g., Rampášek, 2019) may yield better performance if combined with a Bayesian model of high-throughput screening experiments. The model could also be improved to better match the data. For instance, some drugs induce total cell death at high doses, and some have no effect at all below a certain concentration. The logistic-MVN model may be a poor fit in these cases since they push the logits to extreme values and consequently dominate the likelihood; estimating latent upper and lower response bounds, as in Wilson and others (2014), may address this issue. We followed a typical processing pipeline to determine mutations, CNVs, and expression levels. These feature pipelines remove uncertainty in the sequencing process that, if modeled directly, may also lead to a better predictive model. We plan to investigate these extensions in future work. Further, since the initial draft of this article was released, a new version of the GDSC data set (GDSC2) has been made available. However, the format of the controls and features available has changed in the new data set; we plan to adapt our GDSC model to GDSC2 in future work.

Finally, cell lines are only simplified, imperfect models of real patient tumors. To be useful in the clinic, a predictive model will need to incorporate real patient data, such as clinical trial and observational patient health records. As a guide for follow-up experiments, cell lines are a useful tool and predictive modeling of cell line drug response can help to better guide the scientist. As more patient data become available, combining our predictive model for cell lines with patient and other (e.g., mice and organoids) data may lead to a better predictor of actual patient drug response.

6. Conclusions

Beyond the specific model and metrics, our experience led to several observations about measurement and modeling in high-throughput cancer drug screening. We conclude by highlighting six key takeaways:

  • 1. Spatial and temporal batch effects can substantially skew results in high-throughput dose–response experiments. We showed evidence that experimental controls are systematically contaminated, creating dependency between observations. We detailed a correction approach for legacy data that first detects and discards contaminated observations, then leverages temporal dependencies to compensate for the loss of data.

  • 2. The one-trial-per-dose paradigm is problematic. Natural variation among cell lines was observed to be substantial in Section 2.4, with swings as high as Inline graphic of the median control response being common in many experiments. This complicates inference by creating high degrees of uncertainty about any individual experiment.

  • 3. A deep Bayesian dose–response model outperforms the state of the art. The Bayesian model was shown in Section 3.2 to outperform a state-of-the-art technique for dose–response modeling developed for the GDSC data. The posterior mean has Inline graphic lower mean squared error when predicting a held out dose level.

  • 4. All molecular feature types contain relevant information. Through an exhaustive feature subset study in Section 3.3, we showed that the model predictions are improved by each of the three subsets of features (mutations, copy number, and expression).

  • 5. Hierarchical probabilistic models can generate biologically meaningful, nonlinear predictions. A series of examples in Section 3.4 showed that the model recapitulates known biology. In some cases, this involved nonlinear combinations of features such as needing a high level of expression in one gene and a corresponding wild type of another gene for a specific drug to be effective. This suggests the approach has the potential to be used in exploratory drug discovery experiments where candidate drugs are often tried without a known mechanism of action.

  • 6. Amortized conditional randomization testing provides a scalable, powerful way to detect novel markers of drug response. By imposing a factor model structure on binarized features, we were able to conduct Inline graphicM hypothesis tests. The method uncovered both well-studied biomarkers and many new avenues for future research.

Supplementary Material

kxaa047_Supplementary_Data

Acknowledgments

The authors thank Victor Veitch, Mykola Bordyuh, Antonio Iavarone, and Anna Lasorella for many helpful conversations. They also thank Jackson Loper for providing the binary matrix factorization code used in the marker testing.

Conflict of Interest: None declared.

Footnotes

1The terms “negative” and “positive” control here are used differently from their common usage in biology. However, this is the terminology used by Garnett and others (2012); we follow their usage.

2We use medians rather than means to avoid spurious contamination not removed in the decontamination step.

Contributor Information

Wesley Tansey, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, NewYork, NY, USA.

Kathy Li, Data Science Institute, Columbia University and Columbia University Medical Center, New York, NY, USA and Applied Physics and Applied Mathematics, Columbia University and Columbia University Medical Center, New York, NY, USA.

Haoran Zhang, Applied Physics and Applied Mathematics, Columbia University and Columbia University Medical Center, New York, NY, USA.

Scott W Linderman, Data Science Institute, Columbia University and Columbia University Medical Center, New York, NY, USA and Department of Statistics, Columbia University and Columbia University Medical Center, New York, NY, USA.

Raul Rabadan, Department of Systems Biology, Columbia University and Columbia University Medical Center, New York, NY, USA.

David M Blei, Data Science Institute, Columbia University and Columbia University Medical Center, New York, NY, USA, Department of Statistics, Columbia University and Columbia University Medical Center, New York, NY, USA and Department of Statistics, Columbia University and Columbia University Medical Center, New York, NY, USA.

Chris H Wiggins, Data Science Institute, Columbia University and Columbia University Medical Center, New York, NY, USA, Department of Applied Physics and Applied Mathematics, Columbia University and Columbia University Medical Center, New York, NY, USA and Department of Systems Biology, Columbia University and Columbia University Medical Center, New York, NY, USA.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org. Code is publicly available at https://github.com/tansey/deep-dose-response.

Funding

A seed grant from the Data Science Institute of Columbia University and the NIH (U54-CA193313 to W.T.); The Simons Foundation (SCGB-418011 to S.W.L.). The NSF (1305023, 1344668) and NIH (U54-CA193313) to C.H.W.; ONR (N00014-17-1-2131, N00014-15-1-2209), NIH (1U01MH115727-01), and DARPA (SD2 FA8750-18-C-0130) to D.M.B.

References

  1. Ammad-ud din, M., Khan, S. A., Malani, D., Murumägi, A., Kallioniemi, O., Aittokallio, T. and Kaski, S. (2016). Drug response prediction by inferring pathway-response associations with kernelized Bayesian matrix factorization. Bioinformatics 32, i455–i463. [DOI] [PubMed] [Google Scholar]
  2. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V. and Sonkin, D. (2012). The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Candes, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 80, 551–577. [Google Scholar]
  4. Cheng, D. T., Mitchell, T. N., Zehir, A., Shah, R. H., Benayed, R., Syed, A., Chandramohan, R., Liu, Z. Y., Won, H. H. and Scott, S. N. (2015). Memorial Sloan Kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. The Journal of Molecular Diagnostics 17, 251–264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Costello, J. C., Heiser, L. M., Georgii, E., Gönen, M., Menden, M. P., Wang, N. J., Bansal, M., Hintsanen, P., Khan, S. A. and Mpindi, J.-P. (2014). A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology 32, 1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Garnett, M. J., Edelman, E. J., Heidorn, S. J., Greenman, C. D., Dastur, A., Lau, K. W., Greninger, P., Thompson, I. R., Luo, X. and Soares, J. (2012). Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Haibe-Kains, B., El-Hachem, N., Birkbak, N. J., Jin, A. C., Beck, A. H., Aerts, H. J. W. L. and Quackenbush, J. (2013). Inconsistency in large pharmacogenomic studies. Nature 504, 389–393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Haverty, P. M., Lin, E., Tan, J., Yu, Y., Lam, B., Lianoglou, S., Neve, R. M., Martin, S., Settleman, J. and Yauch, R. L. (2016). Reproducible pharmacogenomic profiling of cancer cell line panels. Nature 533, 333–337. [DOI] [PubMed] [Google Scholar]
  9. Iorio, F., Knijnenburg, T. A., Vis, D. J., Bignell, G. R., Menden, M. P., Schubert, M., Aben, N., Gonçalves, E., Barthorpe, S. and Lightfoot, H. (2016). A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Johnson, W. E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127. [DOI] [PubMed] [Google Scholar]
  11. Lachmann, A., Giorgi, F. M., Alvarez, M. J. and Califano, A. (2016). Detection and removal of spatial bias in multiwell assays. Bioinformatics 32, 1959–1965. [DOI] [PubMed] [Google Scholar]
  12. Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E.Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lin, L. and Dunson, D. B. (2014). Bayesian monotone regression using Gaussian process projection. Biometrika 101, 303–317. [Google Scholar]
  14. Loken, E. and Gelman, A. (2017). Measurement error and the replication crisis. Science 355, 584–585. [DOI] [PubMed] [Google Scholar]
  15. Low-Kam, C., Telesca, D., Ji, Z., Zhang, H., Xia, T., Zink, J. I. and Nel, A. E. (2015). A Bayesian regression tree approach to identify the effect of nanoparticles properties on toxicity profiles. The Annals of Applied Statistics 9, 383–401. [Google Scholar]
  16. Mazoure, B., Nadon, R. and Makarenkov, V. (2017). Identification and correction of spatial bias are essential for obtaining quality data in high-throughput screening technologies. Scientific Reports 7, 11921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Menden, M. P., Iorio, F., Garnett, M., McDermott, U., Benes, C. H., Ballester, P. J. and Saez-Rodriguez, J. (2013). Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One 8, e61318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Muir, P., Li, S., Lou, S., Wang, D., Spakowicz, D. J., Salichos, L., Zhang, J., Weinstock, G. M., Isaacs, F. and Rozowsky, J. (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17, 53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Murray, I., Adams, R. and MacKay, D. (2010). Elliptical slice sampling. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: Proceedings of Machine Learning Research, pp. 541–548. [Google Scholar]
  20. Pearl, J. (2009). Causality. Cambridge, United Kingdom: Cambridge University Press. [Google Scholar]
  21. Pemovska, T., Johnson, E., Kontro, M., Repasky, G. A., Chen, J., Wells, P., Cronin, C. N., McTigue, M., Kallioniemi, O. and Porkka, K. (2015). Axitinib effectively inhibits BCR-ABL1 (T315I) with a distinct binding conformation. Nature 519, 102–105. [DOI] [PubMed] [Google Scholar]
  22. Piovan, E., Yu, J., Tosello, V., Herranz, D., Ambesi-Impiombato, A., Da Silva, A. C., Sanchez-Martin, M., Perez-Garcia, A., Rigo, I. and Castillo, M. (2013). Direct reversal of glucocorticoid resistance by AKT inhibition in acute lymphoblastic leukemia. Cancer Cell 24, 766–776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association 108, 1339–1349. [Google Scholar]
  24. Rampášek, L., Hidru, D., Smirnov, P., Haibe-Kains, B. and Goldenberg, A. (2019). Dr. VAE: improving drug response prediction via modeling of drug perturbation effects. Bioinformatics 35, 3743–3751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Rix, U., Hantschel, O., Dürnberger, G., Remsing Rix, L. L., Planyavsky, M., Fernbach, N. V., Kaupe, I., Bennett, K. L., Valent, P. and Colinge, J. (2007). Chemical proteomic profiles of the BCR-ABL inhibitors imatinib, nilotinib, and dasatinib reveal novel kinase and nonkinase targets. Blood, The Journal of the American Society of Hematology 110, 4055–4063. [DOI] [PubMed] [Google Scholar]
  26. Rodriguez-Barrueco, R., Yu, J., Saucedo-Cuevas, L. P., Olivan, M., Llobet-Navas, D., Putcha, P., Castro, V., Murga-Penas, E. M., Collazo-Lorduy, A. and Castillo-Martin, M. (2015). Inhibition of the autocrine IL-6–JAK2–STAT3–calprotectin axis as targeted therapy for HR-/HER2+ breast cancers. Genes & Development 25, 1631–1648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Safikhani, Z., El-Hachem, N., Quevedo, R., Smirnov, P., Goldenberg, A., Birkbak, N. J., Mason, C., Hatzis, C., Shi, L. and Aerts, H. J. W. L. (2016). Assessment of pharmacogenomic agreement. F1000Research 5, 825–835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Shangary, S. and Wang, S. (2008). Targeting the mdm2-p53 interaction for cancer therapy. Clinical Cancer Research 14, 5318–5324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42, 285–323. [Google Scholar]
  30. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 26–31. [Google Scholar]
  31. Toshiyuki, M. and Reed, J. C. (1995). Tumor suppressor p53 is a direct transcriptional activator of the human bax gene. Cell 80, 293–299. [DOI] [PubMed] [Google Scholar]
  32. Vis, D. J., Bombardelli, L., Lightfoot, H., Iorio, F., Garnett, M. J. and Wessels, L. F. A. (2016). Multilevel models improve precision and speed of IC50 estimates. Pharmacogenomics 17, 691–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang, Y.-X., Smola, A. and Tibshirani, R. (2014). The falling factorial basis and its statistical applications. In: International Conference on Machine Learning. pp. 730–738. [Google Scholar]
  34. Wheeler, M. W. (2019). Bayesian additive adaptive basis tensor product models for modeling high dimensional surfaces: an application to high-throughput toxicity testing. Biometrics 75, 193–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wilson, A., Reif, D. M. and Reich, B. J. (2014). Hierarchical dose–response modeling for high-throughput toxicity screening of environmental chemicals. Biometrics 70, 237–246. [DOI] [PubMed] [Google Scholar]
  36. Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D., Smith, J. A. and Thompson, I. R. (2012). Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research 41, D955–D961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxaa047_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES