Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2021 Jun 1;19(6):e3000797. doi: 10.1371/journal.pbio.3000797

An in vitro model of tumor heterogeneity resolves genetic, epigenetic, and stochastic sources of cell state variability

Corey E Hayford 1, Darren R Tyson 2, C Jack Robbins III 2,¤a, Peter L Frick 1,¤b, Vito Quaranta 2,3, Leonard A Harris 4,5,6,*
Editor: Mark L Siegal7
PMCID: PMC8195356  PMID: 34061819

Abstract

Tumor heterogeneity is a primary cause of treatment failure and acquired resistance in cancer patients. Even in cancers driven by a single mutated oncogene, variability in response to targeted therapies is well known. The existence of additional genomic alterations among tumor cells can only partially explain this variability. As such, nongenetic factors are increasingly seen as critical contributors to tumor relapse and acquired resistance in cancer. Here, we show that both genetic and nongenetic factors contribute to targeted drug response variability in an experimental model of tumor heterogeneity. We observe significant variability to epidermal growth factor receptor (EGFR) inhibition among and within multiple versions and clonal sublines of PC9, a commonly used EGFR mutant nonsmall cell lung cancer (NSCLC) cell line. We resolve genetic, epigenetic, and stochastic components of this variability using a theoretical framework in which distinct genetic states give rise to multiple epigenetic “basins of attraction,” across which cells can transition driven by stochastic noise. Using mutational impact analysis, single-cell differential gene expression, and correlations among Gene Ontology (GO) terms to connect genomics to transcriptomics, we establish a baseline for genetic differences driving drug response variability among PC9 cell line versions. Applying the same approach to clonal sublines, we conclude that drug response variability in all but one of the sublines is due to epigenetic differences; in the other, it is due to genetic alterations. Finally, using a clonal drug response assay together with stochastic simulations, we attribute subclonal drug response variability within sublines to stochastic cell fate decisions and confirm that one subline likely contains genetic resistance mutations that emerged in the absence of drug treatment.


Tumor heterogeneity is a primary cause of treatment failure and acquired resistance in cancer patients, but the existence of genomic differences among tumor cells can only partially explain this variability. This study distinguishes genetic, epigenetic, and stochastic sources of heterogeneity in an in vitro tumor model, using a combination of whole-exome sequencing, single-cell RNA sequencing, drug-response profiling, and stochastic simulations.

Introduction

Cancer is a complex and dynamic disease characterized by intertumoral and intratumoral heterogeneities that have been implicated in treatment avoidance and acquired resistance to therapy [1,2]. Genetic differences among cancer cells within and across tumors have long been appreciated [38]. Indeed, genomic instability is a hallmark of cancer [9,10] and is considered to be the primary source of this genetic heterogeneity [11,12]. However, it is becoming increasingly apparent that genetics alone cannot fully explain the wide ranges of responses observed in patient populations to anticancer therapies [13,14]. Epidermal growth factor receptor (EGFR) inhibitors, for instance, are not equally effective across EGFR mutant lung cancer patients, and in almost all cases, tumors eventually acquire resistance and recur [15,16]. Researchers are, therefore, increasingly looking to nongenetic sources of tumor heterogeneity for explanations. These include factors such as cell type of origin, microenvironmental differences between primary and metastatic sites, spatial variations in the microenvironment of an individual tumor, cell plasticity, cell–cell interactions, probabilistic cell fate decisions, and noise in gene expression [17]. Broadly speaking, nongenetic heterogeneity comes in 2 forms [13,1822]: epigenetic, which is heritable [18,20] (for at least a few generations), and stochastic, which is not heritable and arises due to intrinsic or extrinsic factors, such as gene expression noise [2327], asymmetric cell division [28,29], or environmental fluctuations. Nongenetic heterogeneity has been linked to drug tolerance and decreased drug sensitivity in vitro [2,3033], in vivo [30,31,34], and clinically [35,36].

A theoretical concept that connects genetic and nongenetic sources of tumor heterogeneity is the “epigenetic landscape,” proposed by Waddington over 50 years ago [37] but has received renewed attention recently [19,20,38,39] (see Table A in S1 Text for definitions of terms useful for the following discussion). In analogy to the potential energy landscape of physical chemistry [40], Waddington posited that the state of a cell can be assigned a “quasi-potential energy” and placed within a landscape where basins (local minima) correspond to cellular phenotypes. Phenotypic state transitions occur when cells traverse the barriers separating adjacent basins, driven by intrinsic (e.g., gene expression) or extrinsic sources of noise [41,42]. At a fundamental level, the state of a cell arises from the complex set of biochemical interactions within (and possibly across [17,43,44]) cells that drive cellular behavior [45,46]. The rates at which these interactions occur depend strongly on protein structure (e.g., the accessibility of a binding domain), which can be changed by so-called “activating” mutations [47]. Thus, one can think of an epigenetic landscape as deriving from a given genetic state and genetic mutations as altering that landscape [19,39] (Fig 1; see S1 Text for further discussion). Typical tumors comprise numerous genetic states [6,48] and are thus expected to harbor numerous overlapping epigenetic landscapes, each of which is subject to noise-induced phenotypic transitions. Changes in environmental factors and treatment with pharmacological agents can also alter these landscapes [33]. Note that the epigenetic landscape was originally devised as a way to explain cellular differentiation during development. In fact, a developmental hierarchy is a special type of epigenetic landscape, where successive basins decrease in quasi-potential energy, leading cells to descend from one state to the next. In cancer, rather than a well-defined hierarchy, multiple basins of comparable depth coexist. A population of cancer cells is thus expected to spread out across these basins, resulting in a highly heterogeneous population [49]. This may confer a survival benefit to the population in the face of future stressors, such as drug treatments [33,50].

Fig 1. Multiple levels of heterogeneity are believed to operate within tumors.

Fig 1

(left) The genetic “axis” defines accumulating mutational differences that have an effect on phenotype, e.g., drug sensitivity. Although depicted linearly for simplicity, note that mutational accumulation is often nonlinear, occurring instead in a branching manner. (middle) Each genetic clone has an associated epigenetic landscape, where cells are distributed across basins, known as “attractors.” The topography of the epigenetic landscape is defined by the dynamical biochemical network that controls cell fate and function. Quasi-potential energy, U(x), lies along the y-axis and quantifies the relative stabilities of basins; molecular state lies along the x-axis and refers to position within the high-dimensional molecular state space on which the quasi-potential energy is defined (see S1 Text for further discussion). Note that positions of basins across genetic states will not necessarily align (since mutations alter the epigenetic landscape) and they are purposely not aligned in the illustration. (right) Cell states fluctuate within epigenetic basins due to intrinsic (e.g., gene expression) and extrinsic sources of noise. Most fluctuations are minor and do not significantly change the cell state but occasionally a large fluctuation results in a barrier crossing, i.e., a phenotypic state change.

Here, we utilize this three-tiered view of tumor heterogeneity (Fig 1) to resolve genetic, epigenetic, and stochastic sources of variability within a family of cell line “versions” and clonal sublines of PC9, a commonly used nonsmall cell lung cancer (NSCLC) cell line. Based on the individual lineages within the family and the conditions under which each member was derived, we expect this collection of cell lines and sublines to effectively mimic the composition of a genetically and epigenetically heterogeneous tumor. We perform extensive drug response profiling of each family member, followed by genomic and transcriptomic characterization and mathematical population dynamics modeling. Using bulk genomic and single-cell RNA sequencing (scRNA-seq), we verify that the cell line versions are genetically distinct and establish quantitative benchmarks for this distinctiveness at the genomic and transcriptomic levels. Comparing genomic and transcriptomic differences across the sublines against these benchmarks, we argue that all but one of the sublines are genetically indistinct but differ epigenetically, i.e., they correspond to basins within a common epigenetic landscape. This conclusion is further supported by stochastic simulations, which show that in all but one case, variability seen among isolated colonies of a subline in a clonal drug response assay can be explained by intrinsic randomness in cell division and death. We also detail one case where our analyses suggest that a subline harbors a distinct genetic state that appears to have emerged in the absence of selective drug pressure.

Results

Cell line versions and single cell-derived sublines exhibit drug response variability at the cell population level

We chose commonly used NSCLC cell line PC9 [51] as a model system for tumor heterogeneity. The PC9 cell line is characterized by an EGFR-ex19del mutation (S1A Fig), making it sensitive to inhibition of the mutant EGFR protein. We utilize 3 versions of the cell line: PC9-VU, originating from Vanderbilt University [52]; PC9-MGH, maintained at Massachusetts General Hospital [53,54]; and PC9-BR1, derived from PC9-VU and containing a known secondary resistance mutation (EGFR-T790M) obtained through dose escalation therapy in the EGFR inhibitor (EGFRi) afatinib [52]. Although it is unclear when the PC9-VU and PC9-MGH versions (originating from a common founder line) were independently established (S1B Fig), both maintain the oncogenic mutation in the EGFR gene and display sensitivity to EGFR inhibition [54,55]. In the absence of drug, PC9-VU and PC9-BR1 have essentially identical proliferation rates, while PC9-MGH grows at a slightly lower rate (Fig 2A). However, in response to the EGFRi erlotinib, the 3 cell line versions display drastically different drug sensitivities (Fig 2A): PC9-MGH exhibits substantial cell death after an initial equilibration phase (approximately 72 h), PC9-VU settles into a near-zero rate of growth, and PC9-BR1 displays insensitivity to EGFRi (as expected). These observations are consistent with the high sensitivity of PC9-MGH to erlotinib reported in Sharma and colleagues [54] and the lower sensitivity of PC9-VU that we reported previously [55,56].

Fig 2. Phenotypic differences among PC9 cell line versions and discrete sublines quantified in terms of drug response.

Fig 2

(A) Population growth curves for 3 cell line versions treated with 3 μM erlotinib for approximately 3 weeks, plus vehicle (DMSO) control. (B) DIP rate distributions compiled from single-colony growth trajectories under erlotinib treatment (3 μM) in a cFP assay. DIP rates are calculated from the growth curves 48 h postdrug addition to the end of the experiment. Dashed black lines signify zero DIP rate, for visual orientation. (C) Seven DS derived from PC9-VU were treated with 3 μM erlotinib for 3 weeks, along with vehicle control. Parental PC9-VU is included for reference. (D) DIP rate distributions from a cFP assay of the sublines in 3 μM erlotinib. Parental PC9-VU is included for reference. In A and C, dots are the means of 6 experimental replicates at each time point; solid lines are best fits to the drug response trajectories with point-wise 95% confidence intervals. In B and D, DIP rate distributions are plotted as kernel density estimates. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. cFP, clonal fractional proliferation; DIP, drug-induced proliferation; DS, discrete subline.

We also quantified clonal drug response variability within the cell line versions using clonal fractional proliferation (cFP) [57], an assay that tracks the growth of many single cell-derived colonies over time and quantifies drug sensitivity for each colony in terms of the drug-induced proliferation (DIP) rate [55,56], defined as the stable rate of proliferation achieved after extended drug exposure (see Materials and methods). We performed cFP on each cell line version under erlotinib treatment and observed wide ranges of drug responses across colonies (S2 Fig) and substantial differences in the response distributions across versions (Fig 2B). The distribution of DIP rates for PC9-BR1 lies almost entirely in the positive DIP rate range and is clearly distinct from the others. The PC9-VU and PC9-MGH distributions have significant overlap but the PC9-MGH distribution has a marked shoulder in the negative DIP rate region, while the PC9-VU distribution extends further into the positive range. These distributions are consistent with and explain the differential drug responses observed among the cell line versions (Fig 2A): PC9-BR1 is resistant to EGFRi because its DIP rate distribution is entirely in the positive range, PC9-VU goes into a near-zero (slightly positive) growth phase because its DIP rate distribution is centered near zero, and the large shoulder in the PC9-MGH distribution explains why it exhibits significant cell death in the period immediately following drug treatment.

In addition, several single cell-derived discrete sublines (DS1, DS3, DS4, DS6, DS7, DS8, and DS9) were isolated from PC9-VU and subjected to the same analyses as above. In the absence of drug, all sublines grow at almost equal rates in culture (Fig 2C). However, in the presence of EGFRi, the sublines exhibit a wide range of responses, from positive to negative growth (Fig 2C). When overlaid with the cFP results for PC9-VU, the subline responses broadly recapitulate the observed variability seen in the parental line (S2C Fig). A notable exception is DS8, which is essentially resistant to EGFRi, having only a slightly lower proliferation rate than the fully resistant PC9-BR1 (cf. Fig 2A). We also performed cFP assays on the sublines under erlotinib treatment to quantify subclonal drug response variability (Fig 2D). Interestingly, similar to the cell line versions, we found that the sublines also exhibit distributions of DIP rates, albeit narrower than those for the cell line versions. The subline distributions have a large degree of overlap with one another, but the medians of the distributions are statistically distinct (p < 0.001, Mood’s median test). DS8 is again an exception, exhibiting a bimodal DIP rate distribution with a major mode centered close to zero and a large shoulder in the positive DIP rate range.

Cell line versions differ significantly at the genetic and transcriptomic levels

Given that the PC9-VU and PC9-MGH cell line versions have been maintained separately for many years, it is virtually certain that they differ genetically due to genetic drift [48]. We also know that PC9-BR1 contains a known genetic resistance mutation and likely numerous additional mutations acquired during dose escalation. Thus, we performed bulk whole exome sequencing (WES) and scRNA-seq on the cell line versions in order to establish benchmarks for genetic variation against which we can compare the sublines.

From WES, we identify mutations (single nucleotide polymorphisms (SNPs) and insertion/deletions (InDels)) in each cell line version that pass a specified threshold for variant detection (see Materials and methods and S3A and S3B Fig) and calculate the number of mutations per chromosome (Fig 3A). We see a large amount of variability in the number of called variants between the cell line versions (average coefficient of variation (CV) per chromosome = 12.84). We also identify mutations unique to each cell line version (S3C Fig) and calculate the proportionality of unique mutations compared to the total number of mutations (Fig 3B). Although a majority of the mutations are shared among all 3 versions (approximately 106 shared sequence variants), confirming that they are related through a common founder (ancestor) population, a significant number are unique: PC9-BR1 has the largest proportional representation of unique mutations, followed by PC9-MGH and then PC9-VU. Furthermore, we annotate unique mutations within each cell line version with an IMPACT score, a variant severity classifier calculated by the Ensembl Variant Effect Predictor (VEP) [58]. The IMPACT score differentiates mutations based on a variety of factors that predict whether a mutation is likely to have a phenotypic effect (see Materials and methods). Categorizing mutations into “low,” “moderate,” and “high” IMPACT score reveals that PC9-BR1 has many more potentially impactful mutations than PC9-MGH and PC9-VU, which have similar numbers to each other (Fig 3C). However, as a percentage, only 1% of PC9-MGH unique mutations are predicted to be impactful, compared to 11% in PC9-VU, suggesting that PC9-MGH harbors a large number of passenger mutations.

Fig 3. Genomic and transcriptomic characterizations of PC9 cell line versions.

Fig 3

(A) Mean-centered mutation count by chromosome for all cell line versions. For each chromosome, versions with fewer mutations than the mean have a bar pointing inwards, while those with more mutations than the mean point outwards. Chromosome numbers are noted on the outside edge of the circle. Average CV across all chromosomes is noted. (B) Proportions of unique mutations for all cell line versions. (C) Numbers of IMPACT mutations unique to each cell line version, stratified by IMPACT classification (“low,” “moderate,” and “high”). Percentage of unique IMPACT mutations relative to the total number of unique mutations for each cell line version is denoted above each bar. (D) Quantification of mutational differences between cell line versions based on a signature of genes with a high nonsynonymous mutational load. Rows are ordered the same as in Fig 4D and numbered in increments of 10 for ease of reference (see Table B in S1 Text). Heatmap elements are colored based on type of mutation. Total numbers of mutations (stratified by mutation type) across genes and cell line versions are shown as bar plots to the right and above the heatmap, respectively. (E) Copy number variant detection for cell line versions. Red corresponds to amplifications, blue to deletions. The average signal across all cells in the cell line versions was used to define the baseline reference. (F) UMAP visualization of single-cell transcriptomes for cell line versions. For comparison purposes, the UMAP space is defined over all 8 PC9 samples (including cell line versions and PC9-VU sublines; see Materials and methods). (G) GO comparison analysis of unique IMPACT mutations and DEGs for cell line versions. A correlation coefficient (Pearson) was calculated for each sample. Terms with -log10(p) > 2 on either axis are displayed on the plots. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. CV, coefficient of variation; DEG, differentially expressed gene; GO, Gene Ontology; UMAP, Uniform Manifold Approximation and Projection.

We also perform a mutational significance analysis on the unique mutations. Based on nonsynonymous-to-synonymous mutational load, genes are selected to create a mutational signature of genetic differences within each cell line version (see Materials and methods) and displayed as a heatmap. This signature does not reflect all mutated genes in the cell line versions but rather those of predicted importance, similar to the VEP IMPACT score analysis. We see that many mutations in the signature distinguish the cell line versions (Fig 3D). Additionally, we generate a literature-curated, cancer-associated gene signature that includes mutations predicted to be implicated in cancer [2,59] (see Materials and methods). Only PC9-BR1, which has a known resistance mutation (EGFR-T790M, noted as a missense mutation in the heatmap), harbors significant mutational load in this gene signature (S4A Fig). Breakdowns of mutations by type provide additional evidence that PC9-VU mutation representation is distinct and proportionally impactful (S4B Fig).

In addition to traditional variant analyses, we perform copy number variant (CNV) detection at the single-cell level. By exploring the gene expression intensity across positions of the genome, CNVs are detected as gains (red) or deletions (blue) in large chromosomal regions, making it clear which regions of the genome have different relative abundances. Since a natural reference (i.e., a common ancestor) does not exist for the PC9 cell line versions, we establish the baseline as an average signal across cells in all 3 versions (as suggested by the tool instructions; see Materials and methods). Within these analytical limits, we identity several key regions that differentiate the cell line versions (Fig 3E): PC9-MGH has clear duplications at chromosomes 6, 11, and 22 and deletions at chromosomes 16 and 19, while PC9-BR1 has duplications in chromosomes 3 and 14 and deletions in chromosomes 9 and 22. This result fits with our expectation, as both PC9-MGH and PC9-BR1 have had extensive opportunity to acquire large-scale chromosomal changes while evolving in separate environments, and PC9-VU appears to be a closer genetic descendent of the PC9 founder cell line. Together, these genomic data and those presented above (Fig 3A–3D and S3 and S4 Figs) starkly illustrate the genetic differences among the cell line versions.

At the transcriptomic level, we used scRNA-seq to identify gene expression differences among the cell line versions (see Materials and methods and S5A Fig). After feature selection (S5B Fig), we use Uniform Manifold Approximation and Projection (UMAP) [60,61] to project the transcriptional states for each cell into two-dimensional space (Fig 3F and S6A Fig). We see a clear separation of cell line versions in the UMAP space, with minimal overlap (S6B Fig). Qualitatively, distances between the single-cell clusters suggest that PC9-VU and PC9-BR1 are more similar to each other than either is to PC9-MGH, which is unsurprising given that PC9-BR1 was derived from PC9-VU. These results are also supported by alternate dimensionality reduction methods (S6E and S6F Fig) and bulk RNA expression data (S7 Fig). Furthermore, to explore potential biological interpretability of these differences across cell line versions, we score each cell based on 50 hallmark gene signatures of well-defined biological states (S8 Fig). Many of these processes distinguish cell line versions (Table 1), such as IL2/STAT5 and KRAS signaling being overexpressed in PC9-MGH, while PI3K/AKT/mTOR signaling has elevated expression in PC9-VU and PC9-BR1. Interestingly, PC9-BR1 has less expression in Hedgehog signaling than PC9-MGH and PC9-VU, making it an interesting potential side effect of the increased mutational burden.

Table 1. Simplified representation of hallmark gene signature scores across different members of the PC9 cell line family.

Single-cell transcriptomics data from all 8 PC9 cell line family members were subjected to a VISION functional interpretation analysis to calculate scores for single cells from 50 MSigDB hallmark gene signatures. Distributions of scores were calculated for each cell line family member-gene signature pair. Signature score distributions that show larger (↑) or smaller (↓) values compared to other cell line family members are noted. Cell line family members with the largest transcriptomic differences (PC9-MGH, PC9-BR1, DS8, and PC9-VU) were chosen for comparisons. “PC9-VU (no DS8)” includes the parental PC9-VU and DS3, DS6, DS7, and DS9 subline distributions, since they all have a large degree of transcriptomic overlap.

PC9-MGH PC9-BR1 DS8 PC9-VU (no DS8)
↑ Angiogenesis ↑ Cholesterol Homeostasis ↑ Allograft Rejection ↑ Hedgehog Signaling
↑ Apical Surface ↓ Hedgehog Signaling ↑ Androgen Response ↓ IL2/STAT5 Signaling
↑ Bile Acid Metabolism ↓ Xenobiotic Metabolism ↑ Complement ↓ WNT/β-catenin Signaling
↑ IL2/STAT5 Signaling ↑ DNA Repair
↑ KRAS Signaling ↑ Interferon α/γ Response
↑ WNT/β-catenin Signaling ↑ Unfolded Protein Response
↓ Allograft Rejection ↓ P53 Pathway
↓ Coagulation
↓ Interferon α/γ Response
↓ Pancreas β Cells
↓ PI3K/AKT/MTOR Signaling

To quantify how predictive variations in the genomic states are of differential gene expression at the transcriptomic level, we compare Gene Ontology (GO) [62,63] terms associated with high consequence genetic sequence variants (“low,” “moderate,” and “high” IMPACT scores) and GO terms associated with significantly differentially expressed genes (DEGs; adjusted p < 0.05). We visualize these terms based on relative statistical significance (−log(p) for significant GO terms) and quantify the correlation (Spearman) between the genomics- and transcriptomics-derived terms for each cell line version (Fig 3G). Both PC9-BR1 and PC9-MGH have a positive correlation between terms, indicating that terms shared between data modalities tend to agree with each other. Obvious exceptions exist that are more statistically significant for one data modality (top-left and bottom-right corners of the plots), but both the existence of terms for both modalities and the moderate correlation indicate the connection. Notably, PC9-VU has a slightly negative correlation. We also use a semantic similarity metric [64] to compare the 2 sets of GO terms. For each version, we calculate pairwise similarity scores between GO terms for genetic variants and DEGs and obtain an aggregate score between 0 and 1, with 1 indicating that DEGs at the transcriptomic level can be perfectly explained by variations at the genomic level and vice versa (see Materials and methods and S1 Text for details). Relative to a randomized baseline, we see elevated semantic similarity scores for PC9-BR1 in the “Biological Process” (BP) GO category and for PC9-VU in the “Molecular Function” (MF) category (S9 Fig). We also see significant semantic similarity scores for all 3 versions relative to baseline in the “Cellular Component” (CC) category. Taken together, these results (Fig 3G and S9 Fig) indicate a strong connection between mutations in the genomes of the cell line versions and expression at the transcriptomic level. We can use these results as benchmarks for ascertaining whether transcriptomic differences seen among other samples (e.g., single cell-derived sublines; see next subsection) are rooted in genetic differences, like in the cell line versions, or are likely nongenetic in origin.

One PC9-VU subline is genetically distinct, while all others are transcriptomically distinct from each other

We performed the same genomic and transcriptomic experiments as above on 5 PC9-VU sublines (DS3, DS6, DS7, DS8, and DS9) that exhibit differential responses to EGFRi as evidenced by their DIP rate distributions (see Fig 2D): DS3 has a peak in the negative DIP rate range, DS6 has a peak close to zero, DS7 and DS9 have peaks in the positive DIP rate range and nearly overlapping distributions, and DS8 stands out as an obvious outlier with a bimodal DIP rate distribution. Analysis of the total numbers of mutations by chromosome (Fig 4A) shows significantly less variability in the variant count for the sublines relative to the cell line versions (average CV per chromosome = 6.27 versus 12.84; cf. Fig 3A). Additionally, unlike the cell line versions, most sublines exhibit similar proportions of unique mutations (Fig 4B; S3D Fig for full overlap) and numbers of IMPACT mutations (Fig 4C). The clear exception is DS8, which has more unique mutations and more than twice the number of predicted impactful mutations compared to the other sublines. Mutational significance analysis (using the same genomic signature as for the cell line versions; see Fig 3D) shows similar numbers of total and impactful mutations in the sublines (Fig 4D), while DS8 has the largest and most diverse set of mutation types. This is true for the cancer-associated genes as well (S4A Fig). The sublines also exhibit a nearly identical mutation class distribution (S4B Fig). Interestingly, CNVs in the sublines show a slightly more nuanced result (Fig 4E; background PC9-VU in S10 Fig): DS8 has major amplifications (chromosomes 6, 17, and 22) and deletions (chromosomes 7, 14, and 22) and some minor alterations not shared with other sublines; DS3 is missing a deletion present in chromosome 7 in DS6, DS7, and DS9 and also has a unique deletion in chromosome 9; and minor additional sharing is seen among other sublines, such as DS8 and DS9 (deletions in chromosomes 13 and 16). On the whole, DS8 has the clearest cases of unique CNVs in the sublines, while the other sublines remain largely similar except for a few instances. Taken together, with the exception of DS8, these genomic data (Fig 4A–4E and S3 and S4 Figs) illustrate that there is significantly less genomic variability among the PC9-VU sublines than among the cell line versions. It is also important to note that DS8 does not harbor the same resistance conferring mutation (EGFR-T790M) that PC9-BR1 does (S4 Fig), indicating a different (unknown) resistance mechanism is at play (see S1 Text for further discussion).

Fig 4. Genomic and transcriptomic characterizations of PC9-VU discrete sublines.

Fig 4

(A) Mean-centered mutation count by chromosome for 5 (of the 7) sublines. Average CV across all chromosomes is noted. (B) Proportions of unique mutations in each subline. (C) Numbers of IMPACT mutations unique to each subline, stratified by IMPACT classification (“low,” “moderate,” and “high”). Percentage of unique IMPACT mutations relative to the total number of unique mutations in each subline is denoted above each bar. (D) Quantification of mutational differences between sublines based on the same gene signature as for the cell line versions (Fig 3D). Rows are ordered the same as in Fig 3D and numbered in increments of 10 for ease of reference (see Table B in S1 Text). Heatmap elements are colored based on type of mutation. (E) CNV detection for sublines. Red corresponds to amplifications, blue to deletions. PC9-VU was used as the baseline reference to compare sublines. (F) UMAP visualization of single-cell transcriptomes for sublines plotted in the common UMAP space of all 8 PC9 samples (including cell line versions and PC9-VU sublines; see Materials and methods). (G) GO comparison analysis of unique IMPACT mutations and DEGs for sublines. A correlation coefficient (Pearson) was calculated for each sample. Terms with -log10(p) > 2 on either axis are displayed on the plots. Samples without a correlation line or coefficient (DS3, DS6, and DS7) did not meet the minimum threshold of data points to compute a correlation. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. CNV, copy number variant; CV, coefficient of variation; DEG, differentially expressed gene; GO, Gene Ontology; UMAP, Uniform Manifold Approximation and Projection.

Comparing single-cell transcriptomes (Fig 4F), we see distinctions among the sublines but to a much lesser extent than among the cell line versions, except for DS8, which, as in the genomics data, is a clear exception (S7C and S7D Fig). Qualitative distances between single-cell clusters show virtually no separation between DS7 and DS9, small but clear separations between DS3, DS6, and DS7/DS9, and a large separation between DS8 and the other sublines. We also see that DS3, DS6, DS7, and DS9 substantially overlap with the PC9-VU region of the UMAP space (Fig 4F). These observations are largely consistent with the clonal drug responses observed in the cFP assays (i.e., the DS7 and DS9 distributions are almost identical; the DS3 and DS6 distributions are distinct from each other and from DS7/DS9; the DS8 distribution stands apart from the others in being bimodal with a large shoulder extending beyond the upper range of the PC9-VU parental distribution; and the DS3, DS6, DS7, and DS9 distributions overlap substantially with the PC9-VU distribution; see Fig 2D). However, despite the clear separations of the clusters, we do see slight overlaps at the boundaries between the transcriptomic features for DS3 and DS7/DS9 and between DS6 and DS7/DS9, suggesting the possibility of phenotypic transitions occurring between these states. We also see a very small number of DS9 cells (<2%) that overlap with DS8. This is an interesting observation, which could have multiple possible explanations (see S1 Text and S11 Fig for further discussion). In terms of biological interpretability, DS8 has a larger hallmark gene signature score than the other sublines (and cell line versions) for DNA repair, unfolded protein response, and androgen response, while it has a slightly lower score in the p53 pathway (Table 1 and S8 Fig). There are no clear cases where other sublines had a significantly larger hallmark gene signature score.

Statistical comparisons between GO terms associated with high consequence genetic variants and DEGs support a connection between genomics and transcriptomics in DS8 but not in the other sublines (Fig 4G). Although DS3, DS6, and DS7 have a few terms significantly enriched and shared between data modalities, there were not enough data points to compute a correlation. DS9 is an interesting case, where many terms were similar between the modalities but showed a negative correlation (more so than PC9-VU; cf. Fig 3G). In terms of semantic similarity, we see mixed results across sublines (S9 Fig): DS8 has high scores for both BP and MF GO categories but not for CC; DS3, DS6, and DS7 have low scores for BP and MF; DS6 also has a low score for CC but DS3 and DS7 have high scores; and DS9 has a high score for the BP category but low scores for the others. Note that based on the number of GO terms in each category (BP: 12,272, MF: 4,165, and CC: 1,740), we consider BP to be the most predictive of the three, followed by MF and then CC (see Materials and methods). Taken together, these results suggest a strong connection between genomics and transcriptomics in DS8 but weaker or nonexistent connections in the other sublines.

Stochastic birth-death simulations suggest most PC9-VU sublines are epigenetically monoclonal, while one is polyclonal

The PC9-VU sublines exhibit variability in drug response not just at the population level (Fig 2C) but also at the subclonal level, as evidenced by variable colony growth in cFP assays (Fig 5A and 5E and S12A Fig) and quantified as distributions of DIP rates (Fig 2D). To explore the origin of this subclonal variability, we performed stochastic simulations [65] on a simple birth-death model of cell proliferation to ascertain whether intrinsic noise in division/death decisions alone is sufficient to explain experimental observations (see Materials and methods). We performed a battery of in silico cFP assays, where untreated single cells grow into colonies of variable size at a fixed proliferation rate (division rate constant–death rate constant) and are then treated with drug, modeled by reducing the proliferation rate. Colony sizes are tracked over time (Fig 5B and S12B Fig) and DIP rate distributions are calculated and statistically compared against experimental distributions (Fig 5C and S12C Fig). We repeated this procedure for a wide range of division and death rate constants to identify ranges of parameter values that can statistically reproduce experimental DIP rate distributions (p > 0.05, bootstrapped Anderson–Darling (AD) test). For all sublines (except DS8, see next paragraph), we find ranges of parameter values that are physiologically plausible (Fig 5D and S12D Fig). This result is consistent with the view that these sublines (DS1, DS3, DS4, DS6, DS7, and DS9) are monoclonal, i.e., experimental DIP rate distributions can be reproduced with a birth-death model comprising a single cell state (one division and one death rate constant) simulated stochastically.

Fig 5. Stochastic simulations of a simple birth-death model reproduce DIP rate distributions from PC9-VU sublines.

Fig 5

(A) Experimental cFP time courses for 2 representative sublines (DS3 and DS4) in response to 3 μM erlotinib (same data used to generate the corresponding DIP rate distributions in Fig 2D). Each trace corresponds to a single colony, normalized to 72 h postdrug treatment. Only colonies with initial cell counts greater than 50 at the time of treatment are shown. (B) In silico cFP time courses from a one-state model with division and death rate constants that closely reproduce the experimental time courses in A. Trajectories are normalized to the time at which the simulated drug treatment was initiated, simulated cell counts are plotted only at experimental time points, and only colonies with initial cell counts greater than 50 at the time of simulated drug treatment are shown. (C) Comparison of experimental and simulated DIP rate distributions from time courses in A and B. Distributions are compared statistically using the AD test [66]. Dashed black line signifies zero DIP rate, for visual orientation. (D) Parameter scan of division and death rate constants for the 2 sublines in A–C. For each pair of rate constants, the same number of model simulations were run as the associated experimental cFP time courses in A. DIP rates were calculated and compiled into a distribution and then statistically compared against the corresponding experimental DIP rate distribution using the AD test. All parameter pairs with p < 0.05 (see Materials and methods) are colored white, indicating lack of statistical correspondence to experiment. “×” denotes a division and death rate constant pair used in B. (E) Same as A but for subline DS8. (F) Same as B but for DS8 using a two-state model. (G) Same as C but for DS8. (H) Same as D but for DS8 using a four-dimensional (2 division-death rate constant pairs) parameter scan and projected into 2 dimensions. “×” denotes parameter values used to generate the simulated DIP rate distribution in F. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. AD, Anderson–Darling; cFP, clonal fractional proliferation; DIP, drug-induced proliferation.

In contrast, DS8 is an exception once again, displaying greater variability than the other sublines in cFP colony growth rates (Fig 5E) and a bimodal DIP rate distribution (Fig 2D). We performed stochastic simulations on an expanded version of the birth-death model comprising 2 cell states with distinct division and death rate constants (see Materials and methods and S1 Text). As with the other sublines, we performed in silico cFP assays (Fig 5F) and compared the simulated DIP rate distributions against the bimodal distribution seen experimentally (Fig 5G). We again find physiologically feasible ranges of rate parameter values that can statistically reproduce the experimental result (Fig 5H). Thus, these results provide strong evidence that DS8 harbors at least 2 distinct cell states. It is important to note that the mathematical model is agnostic as to whether these 2 states are genetically or epigenetically distinct; it simply states that they have distinct division and death rate constants.

Discussion

Many modern cancer therapies focus on targeting specific genetic mutations within a tumor. Recent studies have shown that a complex interplay between genetic and nongenetic factors likely plays a key role in the failure of targeted treatments [8,53]. Here, we investigated genetic and nongenetic sources of variability in an in vitro tumor heterogeneity model comprising multiple versions (VU, MGH, and BR1) and single cell-derived sublines of the NSCLC cell line PC9 that exhibit a wide range of different responses to EGFR inhibition (Fig 2). Given their histories and how each was derived, we had good reason to believe that the cell line versions were genetically distinct. This was validated using WES and CNV detection, which showed significant mutational differences among them (Fig 3A–3E). Distinct transcriptomic features were also identified by scRNA-seq (Fig 3F), and connections to the underlying genetic states were established by a comparison between GO terms enriched in each data modality (Fig 3G and S9 Fig). We then isolated 7 sublines from PC9-VU that exhibited differential responses to EGFR inhibition (Fig 2C). Clonal drug response assays (Fig 2D) and scRNA-seq analysis (Fig 4F) showed significant overlap with the PC9-VU parental cell line. WES and CNV detection revealed substantially less genomic variability among 6 of the 7 sublines relative to the cell line versions (Fig 4A–4E) and GO similarity analysis indicated a weak connection, if any, between genomic and transcriptomic states in these sublines (Fig 4G and S9 Fig). For the other subline, DS8, the results were dramatically different: DS8 harbors significantly more unique and IMPACT mutations than the other sublines (Fig 4B–4D), there is clearer evidence for copy number variation (Fig 4E), its single-cell transcriptomic state is substantially distinct from the other sublines (Fig 4F), and it displays a much stronger connection between genomic and transcriptomic states (Fig 4G and S9 Fig). Finally, stochastic simulations revealed that colony growth dynamics for 6 of the 7 sublines can be explained as a population with a single cell state experiencing probabilistic division/death decisions (Fig 5A–5D). For DS8, a second cell state had to be included in the model to reproduce the bimodal DIP rate distribution observed experimentally (Fig 5E–5H).

In order to interpret our results, we utilize the theoretical framework for tumor heterogeneity discussed previously [19,20,3739] (Fig 1). As explained, in this view of tumor heterogeneity, tumors may comprise multiple genetic states, each of which has an associated epigenetic landscape with ≥1 quasi-potential energy basins corresponding to phenotypic states, across which cells can transition driven by intrinsic (e.g., gene expression) or extrinsic noise sources. Within our in vitro tumor model, the PC9 cell line versions (VU, MGH, and BR1) correspond to the different genetic states. We assert that 4 of the PC9-VU sublines (DS3, DS6, DS7, and DS9), based on their genomic similarity (Fig 4A–4E), transcriptomic distinctiveness (Fig 4F), weak genetic-to-transcriptomic correspondence (Fig 4G), and monoclonality (Fig 5A–5C), likely correspond to basins within the epigenetic landscape associated with the PC9-VU genetic state. In contrast, DS8 appears to harbor a distinct genetic state that emerged out of PC9-VU at some point in the past in the absence of selective drug pressure. We come to this conclusion based on its resistance to EGFRi (Fig 2C), genomic (Fig 4A–4E) and transcriptomic (Fig 4F) distinctiveness from the other sublines and all 3 cell line versions, strong genomic-to-transcriptomic correspondence (Fig 4G), apparent polyclonality (Fig 5E–5H), and lack of the resistance-conferring mutation (EGFR-T790M) found in PC9-BR1 (S4A Fig). Note that it remains an open question as to whether this new genetic state emerged prior or subsequent to establishment of the DS8 subline (see S1 Text and S11 Fig for further discussion) and what the mechanistic basis for its drug resistance is. These are important questions and areas of future investigation. We summarize our conclusions in a schematic illustrating the different sources of cell state variability we hypothesize are operating within the PC9 family of cell lines and sublines (Fig 6).

Fig 6. Schematic interpretation of the results of our analyses on the PC9 cell line family.

Fig 6

Cell line versions PC9-MGH (light green) and PC9-VU (blue) represent different genetic clones within our in vitro tumor model that emerged from a common PC9 founder clone (gray). Each genetic clone has an associated epigenetic landscape, including PC9-VU, which has at least 3 distinct basins corresponding to the sublines DS3, DS7/DS9, and DS6. Cell line version PC9-BR1 (red) was derived from PC9-VU via drug-induced genetic clonal selection in EGFRi, while DS8 (dark green) independently acquired a genetic resistance mutation in the absence of selective drug pressure. However, it remains unclear from which basin DS8 arose (signified by question marks) and whether the resistance mutation was acquired before or after the subline was established (see S1 Text and S11 Fig). Note that PC9-VU is depicted as lying closer in the phylogenetic tree to the PC9 founder line than PC9-MGH because PC9-VU appears to be a closer genetic descendent of the founder, based on our genomic data. Likewise, DS8, which is believed to have emerged only recently, is depicted closer than PC9-BR1 to the common ancestor PC9-VU. EGFRi, EGFR inhibitor.

This view of tumor heterogeneity, as a three-tiered amalgamation of genetic, epigenetic, and stochastic factors, is not yet broadly accepted within the cancer research community [39], largely because of a lack of strong experimental evidence to support it. A primary goal of this work has been to provide such evidence. However, we believe that numerous reports in the literature are also consistent with this view. For example, Ben-David and colleagues [48] showed that numerous “strains” (comparable to our cell line versions) of human cancer cell lines, obtained from different institutions, display extensive genetic heterogeneity. Moreover, genetically similar strains exhibited similar transcriptomic signatures and drug response profiles. Thus, they argued that cancer cell lines can drift genetically when kept in culture independently, consistent with our results for the PC9 cell line versions (Fig 3). Our conclusion that the drug-resistant DS8 genetic state emerged spontaneously from PC9-VU in the absence of drug (Fig 6) aligns with observations by Ramirez and colleagues [2] and Hata and colleagues [31], who independently reported diverse resistance-conferring mutations arising in both untreated and drug-treated PC9-MGH clones. Shaffer and colleagues [67] described a transient, transcriptionally encoded preresistance state in 2 BRAF mutant melanoma cell lines that cells can transition into and out of in the absence of drug. We hypothesize that this preresistant state may constitute a basin within a BRAF mutant melanoma epigenetic landscape, similar to our single cell-derived sublines (DS3, DS6, DS7, and DS9) that we contend occupy the PC9-VU epigenetic landscape. The veracity of this hypothesis depends on how long cells reside in the preresistant state (its stability) and, correspondingly, whether the state is heritable by progeny cells. These are interesting questions, about which there remains much debate [68,69], and potential avenues of future investigation.

It is now abundantly clear that a focus on tumor genetics alone cannot solve the complex problems of tumor progression, metastasis, and treatment failure that continue to plague clinical oncology [70,71]. The view of tumor heterogeneity advocated in this work offers an alternative to the traditional gene-centric view and may transform how we understand and treat the disease. For example, that each genetic state has an associated epigenetic landscape with potentially numerous accessible phenotypic states may explain why targeted drug treatments eventually fail in almost all cases [72]: Some of these phenotypes may have molecular compositions that enable their survival under drug treatment. Cells preexisting in these states (e.g., the preresistance state of Shaffer and colleagues [67]), and those that escape into them upon drug addition, may act as a refuge from which genetic resistance mutations can arise [33,54]. Alternative treatments based on targeting cancer stem cells (CSCs) [73] have also been proposed but have so far proven unsuccessful [74]. This could be because CSCs correspond to shallow basins within an epigenetic landscape; killing cells in this basin does not eradicate the basin, hence leaving it available to be repopulated by cells from “adjacent” basins. It has been suggested that a better approach, termed “targeted landscaping” [33,50], is to use drugs in combination or in sequence to alter the topography of a landscape to favor drug-sensitive states over drug-tolerant states [75]. The feasibility of such an approach is supported by multiple studies showing that resistance to one drug can confer sensitivity to another, known as “collateral sensitivity” [7678], and that sequential drug applications can often be more effective than up-front drug combination treatments [79,80]. Looking forward, one can envision future cancer treatment regimens involving genetic profiling of a tumor to identify dominant genetic states, followed by characterization of the associated epigenetic landscapes using single-cell experimentation and computational modeling [46,8189]. Large-scale in vitro and in silico drug screens could then be performed to devise personalized treatments for patients that can be tested in vivo before being administered clinically. By leveraging state-of-the-art technologies and currently available drugs to tackle tumor heterogeneity at the genetic, epigenetic, and stochastic levels, this approach may finally give researchers a leg up in the long-standing War on Cancer.

Materials and methods

Cell culture and reagents

PC9-VU and PC9-BR1 were obtained as a gift from Dr. William Pao (Roche, formerly of Vanderbilt University Medical Center). PC9-MGH was obtained as a gift from Dr. Jeffrey Settleman (Massachusetts General Hospital). PC9-VU, PC9-MGH, and PC9-BR1 were individually fluorescently labeled with histone H2B conjugated to monomeric red fluorescent protein (H2BmRFP), as previously described [33,55,56,90,91]. The PC9 cell line versions and derivatives were cultured in Roswell Park Memorial Institute (RPMI) 1640 Medium (Corning Inc., Corning, NY, USA) with 10% fetal bovine serum (Thermo Fischer Scientific, Waltham, MA, USA). Cells were incubated at 37°C, 5% CO2, and passaged twice a week using TrpLE (Thermo Fischer Scientific, Waltham, MA, USA). Cell lines and sublines were tested for mycoplasma contamination using the MycoAlert mycoplasma detection kit (Lonza, Basel, Switzerland), according to manufacturer’s instructions, and confirmed to be mycoplasma free. Cell line identity was confirmed using mutational signatures in WES. Approximately 99% of the PC9-VU mutations were also seen in PC9-MGH, but PC9-MGH exhibited a large number of unique and total mutations independently (S3C Fig). Additionally, using the SNPRelate R package (version 1.18.1), sample genotypes were projected into low dimensional space and clustered based on similarity (S13 Fig). PC9-VU and PC9-MGH clustered together and were represented closely in reduced dimensional space, further suggesting that they arose from the same parent cell population. Erlotinib was obtained from Selleck Chemicals (Houston, Texas) and solubilized in dimethyl sulfoxide (DMSO) at a stock concentration of 10 mM and stored at −20°C. Cell lines were originally stored at −80°C, then moved into liquid nitrogen.

Derivation of single-cell sublines

The PC9-VU sublines were generated by limiting dilution of the parental cell line in 96-well plates. Wells with single cells were expanded for multiple weeks until large enough to be saved as frozen cell stocks. A single stock of each subline was brought back into culture, passaged for 2 weeks, and used for drug response experiments. After 3 to 4 weeks of continued passaging, sublines were used for WES, bulk RNA sequencing, and scRNA-seq experiments. Since sublines were isolated from PC9-VU, they retained the same H2BmRFP nuclear label as cell line versions. An overlap analysis of PC9-VU and derived sublines (S3E Fig) shows the fewest number of unique mutations is in the parental PC9-VU and, therefore, a strong genetic overlap between it and the sublines.

Population-level DIP rate assay

Cells were seeded in black, clear-bottom 96-well plates (Corning Inc., Corning, NY, USA) at a density of 2,500 cells per well with 6 replicates for each sample. Plates were incubated at 37°C and 5% CO2. After cell seeding, drug was added the following morning and changed every 3 days until the end of the experiment or confluency. Untreated samples were allowed to grow in DMSO-containing media until confluency, with media changes every 3 days. Plates were imaged using automated fluorescence microscopy (Cellavista Instrument, Synentec, Elmshorn, Germany). Twenty-five nonoverlapping fluorescent images (20X objective, 5 × 5 montage) were taken twice daily for a total of 500 hours or until confluency. Cellavista image segmentation software (Synentec) was utilized to calculate nuclear count (i.e., cell count) per well at each time point (Source = Cy3, Dichro = Cy3, Filter = Texas Red, Emission Time = 800 μs, Gain = 20×, Quality = High, Binning = 2 × 2). Cell nucleus count across wells was used to calculate mean and 95% confidence intervals and normalized to time of drug treatment. Data was visualized using the ggplot2 R package (version 3.2.0).

Clonal fractional proliferation assay

We modified the original cFP assay, which tracks multiple colonies in a single well of a plate [57]. Instead, here we flow sorted single cells into a black, clear-bottom 384-well plate (Greiner Bio-One, Kremsmünster, Austria) using fluorescence-activated cell sorting (FACS Aria III, RFP+). Plates were incubated at 37°C, 5% CO2, and cells were allowed to grow into small colonies over 8 days in complete media (no media change). Drug was then added and changed every 3 days. Plates were imaged using the Cellavista Instrument (Synentec). Nine nonoverlapping fluorescent images (3 × 3 montage of the whole well at 10X magnification) were taken once daily for a total of 7 days. Cellavista image segmentation software (Synentec) was utilized to calculate nuclear count (i.e., cell count) per well at each time point (Source = Cy3, Dichro = Cy3, Filter = Texas Red, Emission Time = 800 μs, Gain = 20×, Quality = High, Binning = 2 × 2). Depending on the number of wells that passed quality control thresholding (at least 50 cells per colony at the time of treatment), 160 to 280 replicates were included for each sample. DIP rates were calculated from 48 h posttreatment to the end of the experiment using the lm function in R. DIP rates for each sample were combined and plotted as a kernel density estimate. Mood’s median test [92] was performed to determine statistical difference between subline DIP rate distributions using the RVAideMemoire R package. Data were visualized using the ggplot2 package.

DNA bulk whole exome sequencing

Data acquisition

Genomic DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany), according to the manufacturer’s protocol. Libraries were prepared using 150 ng of genomic DNA by first shearing the samples to a target insert size of 200 bp. Illumina’s TruSeq Exome kit (Illumina, Cat: 20020615) was utilized per manufacturer’s instructions. The samples were then captured using the Integrated DNA Technologies (Coralville, IA, USA) xGen Exome Research Panel v1.0 (IDT, Cat: 1056115). The resulting libraries were quantified using a Qubit fluorometer (ThermoFisher Scientific, Waltham, MA, USA), Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA) for library profile assessment, and qPCR (Kapa Biosystems, Wilmington, MA, USA, Cat: KK4622) to validate ligated material, according to the manufacturer’s instructions. The libraries were sequenced using the NovaSeq 6000 with 150 bp paired-end reads. RTA (version 2.4.11; Illumina, San Diego, CA, USA) was used for base calling and sequence-specific quality control analysis was completed using MultiQC v1.7. Reads were aligned to the University of California, Santa Cruz (UCSC) hg38 reference genome using BWA-MEM [93] (version 0.7.17) with default parameters.

Genomic mutational analysis

Mutation analysis for SNPs and InDels was performed using an in-house variant calling pipeline based on the Genome Analysis Toolkit (GATK, Broad Institute) recommendations. Duplicate reads were marked and replaced using PICARD (Broad Institute). Base recalibration and variant calling were performed using GATK version 3.8 (Broad Institute). Variants were selected and filtered based on gold standard SNPs and InDels, as well as a hard filtration according to GATK recommendations (SNPs: QD<2, QUAL<30, SOR>3, FS>60, MQ<40, MQRankSum<−12.5, ReadPosRankSum<−8; InDels: QD<2, QUAL<30, FS>200, ReadPosRankSum<−20). Total variants were counted using VCFtools [94] (version 0.1.15). Sequencing metrics were calculated using vcfR [95] (version 1.8.0). These metrics included read depth, mapping quality, and a Phred quality score [96,97]. Variants were annotated by VEP [58] (Ensembl Genome Browser version 95, available at ensembl.org) with multiple indicators, including chromosome name, gene symbol, mutation class, mutation type, and IMPACT rating. Mutation class corresponds to whether a variant is a substitution (single nucleotide variant (SNV), sequence alteration) or Indel (insertion, deletion, or both—also referred to as indel). Type corresponds to the result of a variant on the amino acid sequence: synonymous (no effect), missense (codon change), nonsense (codon stop or start), splice site (boundary of intron and exon), or a shift in frame (inframe or frameshift). IMPACT rating is a subjective classification of the severity of variant consequence, as defined by Ensembl and based on genetic variant annotation and predicted effects (e.g., amino acid change and protein structure modification). The IMPACT rating categories are as follows: “modifier” (no evidence of impact), “low” (unlikely to have disruptive impact), “moderate” (nondisruptive but might have effect), and “high” (assumed to have disruptive impact). “Modifier” variants were not plotted but constituted a majority of variants in all samples. Variants categorized into “low,” “moderate,” or “high” are referred to as IMPACT mutations in the text. The variant count distribution was organized as a mean-centered mutation count per chromosome for samples within each comparison set (cell line versions and sublines). Variant overlap analysis was conducted using VCFtools and visualized using the UpSet (version 1.4.0) R package. Variants unique to each sample were plotted as proportions of the total mutations in that sample.

Automatic generation of a genetic mutational signature

Unique cell line version mutations were input into dNdScv [98] to generate a mutational signature that could define the genetic heterogeneity within the PC9 cell line family (all cell line versions and sublines). In this analysis, genes are first annotated by type. dNdScv then uses a maximum-likelihood model to detect genes under positive selection (i.e., potential driver mutations). For each gene, a variety of models were utilized to identify genes that have substantially more nonsynonymous mutations than expected in each of the nonsynonymous mutation types, as compared to synonymous mutational load. These metrics were combined together to calculate a global p-value (see original publication [98] for details). We used this approach, with a maximum number of mutations per gene per sample = 5 (tool recommendation; limits a hypermutator phenotype), to determine genes with a global p-value <0.05 in all 8 members of the PC9 cell line family. The resulting gene signature set the baseline for the genetic heterogeneity for all cell populations (see gene list in Table B in S1 Text). We visualized the mutation data as a heatmap of these significant genes and cell line versions or sublines, colored by annotated mutation type (number of variants in each gene-population pair are not annotated in heatmap but are shown in adjacent bar charts for number of variants per gene and sample).

Literature-curated, cancer-associated mutational signature

A cancer-associated gene list was established to supplement the predicted genetic heterogeneity signature from dNdScv. The gene list was created from the NIH Genetics Home Reference (GHR) key lung cancer genes (ghr.nlm.nih.gov/condition/lung-cancer#genes) and 2 additional publications of key mutations in lung cancer [2,59] (see S4A Fig for the gene names). Associated heatmaps were generated of this gene list for cell line versions and sublines, colored by annotated mutation type.

RNA single-cell transcriptome sequencing

Data acquisition

scRNA-seq libraries were prepared using the 10X Genomics gene expression kit (version 2, 3′ counting [99]; S5 Fig) and cell hashing [100]. Cells were prepared according to recommendations from the cell hashing protocol on the CITE-seq website (cite-seq.com/protocol). After cell preparation, 1 ng of 8 different cell hashing antibodies (TotalSeq-A025(1–8) anti-human Hashtag, BioLegend, San Diego, CA, USA) were used to label each of the 8 samples (labeling efficiency in S14 Fig). Hashed single-cell samples were combined in approximately similar proportions and “super-loaded” (aiming for approximately 20,000 cells, approximately 15,400 cells were obtained) onto the Chromium instrument. Cells were encapsulated according to manufacturer guidelines. Single-cell mRNA expression libraries were prepared according to manufacturer instructions. The leftover eluent of the mRNA expression library, containing the hashtag oligonucleotides (HTOs), was utilized to further size select the HTO library. The size-selected HTO library was PCR amplified to obtain higher-quality reads. Libraries were cleaned using SPRI beads (Beckman Coulter, Brea, CA, USA) and quantified using a Bioanalyzer 2100 (Agilent). The libraries were sequenced using the NovaSeq 6000 with 150 bp paired-end reads targeting 50 M reads per hashed sample for the mRNA library and a spike-in fraction for the HTO library. RTA (version 2.4.11; Illumina) was used for base calling and MultiQC (version 1.7) for quality control. Gene counting, including alignment, filtering, barcode counting, and unique molecular identifier (UMI) counting, was performed using the count function in the 10X Genomics software Cell Ranger (version 3.0.2) with the GRCh38 (hg38) reference transcriptome (S5A Fig). We utilized CITE-seq-Count (github.com/Hoohm/CITE-seq-Count) to count HTOs from the HTO library. We then used the Demux function in the R package Seurat [101] (satijalab.org/Seurat, version 3.2.2) to demultiplex the HTO and mRNA libraries and pair cells to their associated hashtag. Data were integrated into a count matrix with genes and cells, with HTO identity as a metadata tag.

Data analysis

After creating the demultiplexed, single-cell gene expression matrix, we removed cell multiplets using both cell hashtags (S15A Fig; at least 2 different hashtags detected with a single cell barcode) and automated doublet detection (DoubletFinder [102] version 3 with default parameters; S15B Fig). Additionally, poor-quality cells were removed based on a minimum cutoff of features (number of genes detected in each cell = 3,000) and count (number of RNA molecules detected within each cell = 15,000). These numbers were chosen subjectively but with respect to the overall sequencing depth in order to remove droplets with ambient RNA. Cells below these thresholds fell in isolated regions of the UMAP space and had a large degree of overlap with all other samples (S15C Fig). Feature selection was performed according to Seurat guidelines (0.1 < average gene expression < 8, log variance-to-mean ratio > 1; 574 genes met criteria). Data were visualized using the UMAP dimensionality reduction algorithm, as implemented in the Seurat [101] R package. To facilitate comparisons across cell line versions and sublines, we performed the UMAP projections in the space of all 8 cell line versions and PC9-VU sublines (PC9-VU, PC9-MGH, PC9-BR1, DS3, DS6, DS7, DS8, and DS9). To quantify overlap of cells between transcriptomic features, we performed k-means clustering of the cell line versions (k = 3) and the sublines (k = 2) using the NBClust package, which also identified the optimal number of clusters based on 30 different methods. Differential expression analysis was performed between a single sample (cell line version or subline) and the rest of the PC9 cell line family members using the Wilcoxon rank sum test [103] (as implemented in the FindMarkers function in Seurat, default settings). DEGs (adjusted p < 0.05) were used for downstream analyses (see “Gene Ontology analysis” below). In addition to UMAP, t-distributed Stochastic Neighbor Embedding (t-SNE) and principal component analysis (PCA) were also performed, using the Seurat implementation.

Copy number variation analysis

CNVs were inferred from scRNA-seq data using inferCNV (Trinity CTAT Project, github.com/broadinstitute/inferCNV). The single-cell transcriptome count matrix (see “RNA single-cell transcriptome sequencing: Data acquisition” above), an annotation file (pairing each cell to its corresponding PC9 cell line family member), and a gene order file (derived from the GRCh38/hg38 gtf file) were used to create an inferCNV object. Separate objects were created for cell line versions (no reference group; an average of the 3 versions was used, default setting) and sublines (PC9-VU reference group; S10 Fig). The inferCNV analysis was run with a cutoff of 0.1 (default for droplet-based experimental methods). Clustering was performed based on annotation file groups (i.e., cell line versions and sublines). Analysis settings were to denoise the dataset and use a hidden Markov model for CNV predictions. Heatmaps of relative expression values (to the reference group) were output by chromosome for all cells in the analysis. Red corresponds to increased expression (amplification) and blue to decreased expression (deletion).

scRNA-seq functional interpretation analysis

The single-cell transcriptome count matrix (see “RNA single-cell transcriptome sequencing: Data acquisition” above) was scaled by multiplying counts by the median RNA molecules across all cells and dividing that number by the number of RNA molecules in each cell (as recommended). Gene signature files were obtained from the molecular signatures database (MSigDB). Hallmark gene sets (50 in total) were downloaded from MSigDB (gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=H). Both the scaled counts matrix and each of the hallmark gene sets were input into VISION [104] to identify gene signature scores for each cell-signature pair. Four hallmark gene sets (KRAS_SIGNALING_UP, KRAS_SIGNALING_DOWN, UV_RESPONSE_UP, and UV_RESPONSE_DOWN) were condensed into 2 (KRAS_SIGNALING and UV_RESPONSE) by VISION to leave 48 total gene signatures. Scores were compiled into a distribution and plotted by PC9 cell line family member for each gene set.

RNA bulk transcriptome sequencing

Data acquisition

Total RNA was extracted using a Trizole extraction (ThermoFisher), according to the manufacturer’s protocol. RNA-seq libraries were prepared using 200 ng of total RNA and the NEBNext rRNA Depletion Kit (New England Biolabs, Ipswich, MA, USA, Cat: E6310X), per manufacturer’s instructions. The kit employs an RNaseH-based method to deplete both cytoplasmic (5S rRNA, 5.8S rRNA, 18S rRNA, and 28S rRNA) and mitochondrial ribosomal RNA (12S rRNA and 16S rRNA). The mRNA was enriched via poly-A-selection using oligoDT beads and then the RNA was thermally fragmented and converted to cDNA. The cDNA was adenylated for adaptor ligation and PCR amplified. The resulting libraries were quantified using a Qubit fluorometer (ThermoFisher), Bioanalyzer 2100 (Agilent) for library profile assessment, and qPCR (Kapa Biosciences, Cat: KK4622) to validate ligated material, according to the manufacturer’s instructions. The libraries were sequenced using the NovaSeq 6000 with 150 bp paired-end reads. RTA (version 2.4.11, Illumina) was used for base calling and MultiQC (version 1.7) for quality control. Reads were aligned using STAR [105] (version 2.5.2b) with default parameters to the STAR hg38 reference genome. Gene counts were obtained using the featureCounts [106] package (version 1.6.4) within the Subread package. The gene transfer format (GTF) file for the genes analyzed in the scRNA-seq data (provided by 10X Genomics and used in the Cell Ranger pipeline, generated from the hg38 reference transcriptome) was used to better facilitate internal comparison between scRNA-seq and bulk RNA-seq datasets.

Data analysis

RNA-seq data were analyzed using the DESeq2 [107] R package. Counts were transformed using the regularized logarithm normalization algorithm, as implemented in the rlog function of DESeq2. PCA was performed using the prcomp function in R and hierarchical clustering using the hclust R function with a Ward’s minimum variance method. Data were visualized using the ggplot2 R package.

Gene ontology analysis

Setup

Genes associated with unique IMPACT mutations (classified as “low,” “moderate,” or “high” IMPACT scores; see “DNA bulk whole exome sequencing: Genomic mutational analysis) were identified for each comparison set (i.e., cell line versions or sublines). Additionally, DEGs from the scRNA-seq statistical comparisons for each comparison set were determined (see “RNA single-cell transcriptome sequencing: Data analysis” above). The 2 gene lists were independently subjected to a GO enrichment analysis using EnrichR [108] (version 2.1). Genes were compared to the ontology databases GO Biological Process 2018 (BP), GO Molecular Function 2018 (MF), and GO Cellular Component 2018 (CC), which we refer to as GO “type” in the text.

Correlation analysis

GO terms significantly enriched in the IMPACT mutations (p < 0.05) and in DEGs (p < 0.05) were identified and stored independently as separate GO term lists for each PC9 cell line family member. For terms shared between the lists, we calculated -log10(p) to rank terms based on statistical significance. Spearman correlation was calculated between the significant GO terms using the ggpubr R package (version 0.4.0), as long as a minimum of 5 significant terms were shared between the IMPACT and DEG GO term lists. Outlier GO terms (−log10(p) > 10) were excluded from the analysis (2 terms in PC9-BR1, 1 in PC9-MGH), in order to not unfairly skew correlation calculations.

Semantic similarity analysis

GO term lists for each PC9 cell line family member were further separated into GO types, which created GO term lists unique for each combination of sample (cell line version or subline), data modality (IMPACT mutations or DEGs), and GO type (“BP,” “MF,” or “CC”). For each sample, the mutation and DEG GO term lists associated with each GO type were compared using the semantic similarity metric from Wang and colleagues [64], as calculated in the GOSemSim package [109] (version 2.12.1) using the goSim function. This approach compares 2 individual GO terms using the underlying GO term network structure. Pairwise similarities were calculated on the lists of terms to generate similarity matrices for each sample. In order to avoid the dismissal of terms near any proposed statistical cutoff and ensure lists were of a minimum length, mutation and DEG GO term lists associated with each GO type for each sample were chosen randomly based on a modified p-value metric from the GO enrichment analysis. Specifically, terms were chosen from each list if a random number (between 0 and 1) was greater than the GO enrichment p-value. Semantic similarity distributions had a large skew, biased heavily toward lower values, primarily due to the size of the GO type graph network structure and, therefore, the “distance” between terms in the graph. To combat this issue, a maximum range of semantic similarity scores were chosen for each comparison (similar to but more robust than the “best max average” option provided in GOSemSim). The median of the top 1,000 scores and a 95% confidence interval were calculated for each sample-GO type comparison.

Semantic similarity scores were also correlated with the number of genes input into the GO enrichment analysis. To address this problem, random gene lists of the same lengths as IMPACT and DEG gene lists were chosen and input into the same process as experiment-derived gene lists in order to generate a simulated semantic similarity distribution. Depending on the length of the lists, these simulated distributions varied in the both the median and variance. Therefore, instead of comparing experimental and simulated distributions, experimental semantic similarity scores were normalized to the median + one standard deviation of the simulated score, for each sample. These relative GO semantic similarity score distributions are represented in plots. Importantly, the number of GO terms varied across GO types (according to GOSemSim; BP: 12,272, MF: 4,165, and CC: 1,740). We assume that GO types with more terms are more biologically significant, i.e., BP is more predictive than MF, followed by CC. Data were visualized using the ggplot2 R package.

In silico modeling of clonal fractional proliferation

Birth-death population growth models

Mathematical models of population growth dynamics were constructed using PySB [110], a Python-based kinetic modeling and simulation framework. We modeled cell proliferation as a simple birth-death process,

Cellikdiv,iCelli+Celli (1)
Cellikdth,i (2)

where i is an integer index, kdiv,i and kdth,i are division and death rate constants, respectively, for cell type i, and ∅ denotes cell death. Note that there is no state switching included in the model. Models with one cell type were used to compare against experimental cFP data for the majority of the PC9-VU sublines (DS1, DS3, DS4, DS6, DS7, and DS9), while a two-cell-type model was used in one case (DS8).

Stochastic simulations and in silico DIP rate distributions

All model simulations were run using the stochastic simulation algorithm (SSA) [65,88], as implemented in BioNetGen [111] (invoked from within PySB), to capture the effects of random fluctuations in division and death on cell population proliferation. We performed in silico cFP assays, where numerous single cells (run as independent simulations) were grown into colonies of variable size over 8 days of simulated time using the SSA and fixed rate constants for division and death (kdiv = 0.04⋅ln(2) h−1, kdth = 0.005⋅ln(2) h−1; Table C in S1 Text), based on vehicle-control proliferation data (Fig 1C, dashed lines; in the case of 2 cell types, both types were assumed to grow at the same rate outside drug). We ran as many simulations as there were experimental cFP trajectories for the PC9-VU subline being compared against (Fig 5A and 5E and S12A Fig). Drug treatment was then modeled by changing the rate constants for division and death (for 2 cell types, each was assumed to proliferate at different rates in drug; Table C in S1 Text) and running for the additional days of simulated time corresponding to each subline experiment. Simulated time courses were plotted at the same time points as in the corresponding experimental cFP assays for direct comparison. In silico DIP rates were obtained by taking log2 of the total cell counts and calculating the slope of a linear fit to the time course from the time of drug addition to the end of the simulation using the SciPy [112] linregress function. DIP rates for all in silico colonies were compiled into a distribution and compared to the corresponding experimental cFP distribution using the AD test [66]. The p-value for each simulation result was bootstrapped based on 100 resamples of the experimental distribution. For DS8, a two-state model was used. All aspects of the model were the same, except drug treatment was modeled by 2 sets of division and death rate constants. These simulations were constrained by the approximate DIP rate ranges for both DIP rate distribution modes (Fig 1D).

Parameter scans

We repeated the simulation procedure above over ranges of physiologically plausible division and death rate constant values (Table C in S1 Text). For each parameter set (either 1 or 2 pairs of postdrug division/death rate constants, depending on subline), AD tests for simulated versus experimental DIP rate distributions were performed. To account for variabilities in individual comparisons, only simulations with a mean p-value minus one standard deviation from the bootstrapped result were kept. Any simulation with <p>−sdev(p) > 0.05 (i.e., we cannot reject the null hypothesis that the DIP rates are drawn from the same distribution) were plotted in a heatmap using the ggplot2 R package. Thus, all colored elements in the heatmap represent sets of division/death rate constants that produce DIP rate distributions statistically indistinguishable from the experimental distributions obtained from cFP assays. Note that the scan for the two-cell-type model (DS8) was performed in 4 dimensions (2 division and 2 death rate constants), but results were plotted in 2 dimensions for visual simplicity.

Supporting information

S1 Text. Extended discussions regarding genetic, epigenetic, and stochastic sources of tumor heterogeneity.

Four subsections are included, discussing (i) descriptions of the genetic, epigenetic, and stochastic levels of heterogeneity believed to coexist within tumors, (ii) the fundamental relationship between transcriptomics and epigenetics, (iii) the rationale behind the genetic-to-epigenetic correlation metric utilized in this work, and (iv) 2 hypotheses regarding the origins of the DS8 genetic mutant subline. Three tables are also included containing (A) a glossary of terms, (B) a list of genes associated with mutation heatmaps in Figs 3D and 4D, and (C) rate parameters used for stochastic simulations.

(PDF)

S1 Fig. The PC9 cell line family.

(A) Identification of canonical EGFR-ex19del in PC9 cell line family members. A screenshot from the IGV is shown. Red corresponds to potential deletions and blue to potential insertions. The data underlying this image can be found in the Sequence Read Archive (ncbi.nlm.nih.gov/sra) at accession #PRJNA632351. (B) PC9 cell line family tree. Two versions the PC9 cell line were maintained separately in culture at 2 different institutions (VU and MGH). A resistant cell line (PC9-BR1) was derived from PC9-VU by dose escalation in the EGFRi afatinib. Several DS were also single-cell isolated from PC9-VU. Colors are consistent with data visualizations in main and supplementary figures. DS, discrete subline; EGFR, epidermal growth factor receptor; EGFRi, EGFR inhibitor; IGV, Integrative Genomics Viewer; MGH, Massachusetts General Hospital; VU, Vanderbilt University.

(TIF)

S2 Fig. cFP assays for PC9 cell line versions.

(A) PC9-MGH treated with erlotinib. (B) PC9-BR1 treated with erlotinib. All trajectories in A and B are normalized to approximately 72 h postdrug treatment. (C) PC9-VU treated with erlotinib. Trajectories for the parental line (gray) and the discrete sublines (colors) are plotted together for comparison. All subline trajectories, except DS8, are normalized to approximately 125 h post-erlotinib treatment; DS8 was normalized to the time of treatment because it was resistant and reached confluency during the course of the experiment. For the sublines, means of time point replicates are plotted. In all cases, number of colonies (n) are noted within the plots. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. cFP, clonal fractional proliferation.

(TIF)

S3 Fig. Metrics of WES data.

(A) Total number of mutations identified through variant calling compared to hg38 reference genome. Mutations are separated into substitutions, specifically SNPs and InDels. (B) Sequencing quality metrics for the PC9 cell line family (considered together as one group). DP is a measure of sequence coverage; MQ details how well the sequencing reads are mapped to the reference genome; QUAL is a score developed for Phred base calling that measures the confidence in called variants based on sequencing error probabilities; variant count is a reflection of the variants per site identified over small sections (windows) of the reference genome. (C–E) Quantified Venn diagram (i.e., UpSet plot) of unique, and intersections of, mutations in (C) cell line versions, (D) PC9-VU sublines, and (E) PC9-VU sublines and parental. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. InDel, insertion/deletion; SNP, single nucleotide polymorphism; WES, whole exome sequencing.

(TIF)

S4 Fig. Additional mutation analyses across cell line versions and sublines.

(A) Mutational differences between PC9 cell line family members for a literature-curated set of cancer-associated genes implicated in lung cancer (see Materials and methods). Heatmap elements are colored based on type of mutation. (B) Mutation class pie charts. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. Indel, insertion/deletion; SNV, single nucleotide variant.

(TIF)

S5 Fig. Metrics for scRNA-seq comparisons.

(A) Cell Ranger (support.10xgenomics.com/single-cell-gene-expression/software) output file detailing metrics of sequencing run (quality, mapping, barcode identification, etc.). The data underlying this image can be found in the Gene Expression Omnibus (ncbi.nlm.nih.gov/geo) data repository at accession #GSE150084. (B) Feature identification for genes that transcriptomically differentiate PC9 cell line family members. Variable genes are projected on a plot of dispersion vs. average gene expression and genes that pass a feature selection threshold are shown in red (0.1<average gene expression<8, log variance-to-mean ratio>1; 574 genes). The data underlying this plot can be found in github.com/QuLab-VU/GES_2021. scRNA-seq, single-cell RNA sequencing; UMI, unique molecular identifier.

(TIF)

S6 Fig. Clustering and alternative visualizations of scRNA-seq data.

(A) Clustering of cell line versions. Number of clusters (3) was defined based on majority rule from a consensus of 30 indices. Ward’s minimum variance method was used. (B) Quantification of cluster fraction by cell line version. (C) Same as A but for sublines. Two clusters were found to be the consensus. (D) Same as B but for sublines. (E) PCA visualization of single-cell transcriptomes. (F) t-SNE visualization of single-cell transcriptomes. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; scRNA-seq, single-cell RNA sequencing; t-SNE, t-distributed Stochastic Neighbor Embedding.

(TIF)

S7 Fig. Bulk RNA-seq data.

(A) PCA of single-replicate normalized RNA-seq count data. (B) Hierarchical clustering of RNA-seq normalized count data. Clustering was performed on the pairwise Euclidian distance matrix created from the relative log transformed gene counts using the Ward’s minimum variance method. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; RNA-seq, RNA sequencing.

(TIF)

S8 Fig. VISION transcriptome functional interpretation analysis.

Single-cell gene expression matrix and MSigDB hallmark gene signatures were input to create a signature score for each cell. Scores were totaled for each population across each hallmark and plotted as a density distribution. All 50 hallmark signatures were sampled. Note that “KRAS signaling” and “UV response” had hallmark signatures for both up- and down-regulated. We condensed these 4 signatures into 2, leaving 48 hallmark signatures total. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. MSigDB, molecular signatures database.

(TIF)

S9 Fig. GO semantic similarity scores for each GO ontology type.

Significantly enriched GO terms for each data modality pair (IMPACT mutations and DEGs) were compared for each cell line family member for each GO type (BP, MF, and CC). The top 1,000 similarity scores within each pair were compiled into a distribution to calculate a median (white circle) and 95% confidence interval (error bars). Scores are plotted relative to a baseline, defined as the median + one standard deviation of simulated distributions (dashed lines). Simulated score distributions were calculated based on random gene lists of identical lengths to the experimental gene lists (see Materials and methods). The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. BP, Biological Process; CC, Cellular Component; DEG, differentially expressed gene; GO, Gene Ontology; MF, Molecular Function.

(TIF)

S10 Fig. Baseline expression values across chromosomes for PC9-VU parental.

Values are plotted as a heatmap. All PC9-VU sublines were compared against this baseline in the CNV analysis. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. CNV, copy number variant.

(TIF)

S11 Fig. Potential explanations for cell state heterogeneity in DS8.

(A) Multiple genetic states hypothesis. In this scenario, a genetic resistance mutation was acquired after the DS8 subline was established. Assuming the mutant state does not outgrow the original genetic state (i.e., a “selective sweep”), both genetic states should coexist within the subline. (B) Single genetic state hypothesis. In this scenario, a genetic resistance mutation emerged within the PC9-VU parental population and a cell containing that mutation was isolated to establish the DS8 subline. To explain our single-cell transcriptomics data, we hypothesize that cell–cell interactions between mutant and PC9-VU cells increase the death rate for mutant cells, making them a small proportion (<2%) of the total PC9-VU population.

(TIF)

S12 Fig. Comparison of experimental and simulated cFP assays for PC9-VU sublines.

(A) Experimental cFP time courses for 4 PC9-VU sublines (DS1, DS6, DS7, and DS9) in response to 3 μM erlotinib (same data used to generate DIP rate distributions in Fig 2D of the main text). Each trace corresponds to a single colony, normalized to 72 h postdrug treatment. Only colonies with cell counts greater than 50 at the time of treatment were kept; n represents the number of colony traces for each subline. (B) Simulated cFP time courses generated using division and death rate constants that closely reproduce the experimental time courses in A. Trajectories are normalized to the time at which the simulated drug treatment was initiated and simulated cell counts are plotted only at experimental time points. Although the same number of simulations were initiated as the number of colonies (n) in the corresponding experiment (see panel A), only simulated colonies with cell counts >50 at the time of simulated drug treatment are shown. (C) Comparison of experimental and simulated DIP rate distributions calculated from time courses in A and B. Distributions are compared statistically using the AD test (see Materials and methods). Bootstrapped p-values are shown (mean and standard deviation). Dashed black line signifies zero DIP rate, for visual orientation. (D) Parameter scan of division and death rate constants for the 4 sublines in A–C. For each pair of rate constants, we ran model simulations (same number as corresponding subline), calculated DIP rates, compiled them into a distribution, and then statistically compared against the corresponding experimental DIP rate distribution using the AD test (bootstrapped). All p < 0.05 are colored white, indicating lack of statistical correspondence to experiment. × denotes a division and death rate constant used in B. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. AD, Anderson–Darling; cFP, clonal fractional proliferation; DIP, drug-induced proliferation.

(TIF)

S13 Fig. Genomic relatedness between PC9 cell line family members.

(A) PCA of PC9 genotypes. Using a subset of SNPs in approximate linkage equilibrium, a genetic covariance matrix was calculated. The covariance matrix was converted to a correlation matrix to achieve appropriate scaling and PCA was run to identify SNP eigenvectors (loadings of the principal components). PC9 cell line family members are plotted along the principal component axes. (B) Hierarchical clustering of PC9 genotypes. Using an identity-by-state analysis, a matrix of genome-wide pairwise identities was calculated. Hierarchical clustering was performed on these identities to determine sample relatedness. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; SNP, single nucleotide polymorphism.

(TIF)

S14 Fig. Sample identification in “hashed” PC9 cell line family members.

Proportional representation of cell populations with each of 8 specific “hashtag” antibodies, based on the HTO expression level. Each sample has a single corresponding HTO, while a minority of the HTO reads were unmapped. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. HTO, hashtag oligonucleotide.

(TIF)

S15 Fig. scRNA-seq quality control analyses.

(A) Cell hashing allowed for detection of cell multiplets (1 droplet with more than 1 cell) because multiple HTOs would be detected for a single cell barcode (i.e., droplet). Cells were segregated into singlets and doublets (i.e., multiplets). All detected cell transcriptomes were visualized using UMAP, noting singlets and doublets. (B) Automated DD was performed on the detected cell transcriptomes. Doublets were predicted and noted on the same UMAP visualization as in A. (C) Cells were scored based on number of features (e.g., genes) and count of detected RNA molecules. Cells with scores below a specified threshold (see Materials and methods) were classified as “poor” quality. Quality of each cell was noted on the same UMAP visualizations as in A and B. A total of 7,892 cells passed singlet and quality control thresholding. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. DD, doublet detection; HTO, hashtag oligonucleotide; scRNA-seq, single-cell RNA sequencing; UMAP, Uniform Manifold Approximation and Projection.

(TIF)

Acknowledgments

This work is dedicated to the memory of our friend and colleague Melaine N. Sebastian. We thank Jing Hao for reagent acquisition and Tony Capra, Bishal Paudel, Christian Meyer, Sarah Maddox Groves, Carlos Lopez, Alissa Weaver, John McLean, Maizie Zhou, and Ken Lau for useful discussions.

Sequencing studies were performed by the Vanderbilt Technologies for Advanced Genomics (VANTAGE, Vanderbilt University Medical Center) core, an institutionally supported core facility with help from Angela Jones, Karen Beeri, Jamie Roberson, Latha Raju, and Matthew Scholz. Sorting of labeled cells and single-cell seeding for cFP assays were performed with oversight from the Flow Cytometry Shared Resource (Vanderbilt University Medical Center). Drug changes on 384-well plates for cFP assays were performed at the Vanderbilt High-Throughput Screening (HTS) facility. Data processing and model simulations were performed using the computational resources available at the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University.

Abbreviations

BP

Biological Process

CC

Cellular Component

cFP

clonal fractional proliferation

CNV

copy number variant

CSC

cancer stem cell

CV

coefficient of variation

DEG

differentially expressed gene

DIP

drug-induced proliferation

DMSO

dimethyl sulfoxide

EGFR

epidermal growth factor receptor

EGFRi

EGFR inhibitor

FACS

fluorescence-activated cell sorting

GATK

Genome Analysis Toolkit

GHR

Genetics Home Reference

GO

Gene Ontology

GTF

gene transfer format

HTO

hashtag oligonucleotide

InDel

insertion/deletion

MF

Molecular Function

MSigDB

molecular signatures database

NSCLC

nonsmall cell lung cancer

PCA

principal component analysis

scRNA-seq

single-cell RNA sequencing

SNP

single nucleotide polymorphism

SSA

stochastic simulation algorithm

t-SNE

t-distributed Stochastic Neighbor Embedding

UMAP

Uniform Manifold Approximation and Projection

UMI

unique molecular identifier

VEP

Variant Effect Predictor

WES

whole exome sequencing

Data Availability

The sequencing datasets generated in this study can be found in the gene expression omnibus (GEO; GSE150084) and sequence read archive (SRA; PRJNA631050 and PRJNA632351). Additional experimental data are available on Github (github.com/QuLab-VU/GES_2021). The codes used to generate model simulations and analyze experimental data are publicly available via GitHub (github.com/QuLab-VU/GES_2021).

Funding Statement

This work was supported by the following funding sources: C.E.H., National Institutes of Health (NIH) Ruth L. Kirschstein National Research Service Award (NRSA, F31-CA221147) and Chemical-Biology Interface Training Grant (T32-GM0650); L.A.H., Vanderbilt Biomedical Informatics Training Program (NLM 5T15-LM007450-14), Quantitative Systems Biology Center at Vanderbilt, and National Cancer Institute (NCI) Transition Career Development Award to Promote Diversity (K22-CA237857-01A1); D.R.T., Lung Cancer Research Foundation (LCRF, UALC 13020513) and NIH Research Specialist Award (1R50CA243783); P.L.F., NIH NRSA (F31-CA165840); C.J.R., Vanderbilt Trans-Institutional Programs Grant: Understanding the Complexity of Life One Cell at a Time; V.Q., NIH Clinical and Translational Science Award (U54-CA113007); Sequencing studies were supported by the Vanderbilt Institute for Clinical and Translational Research (VICTR, Voucher VR52385). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Zhao B, Hemann MT, Lauffenburger DA. Intratumor heterogeneity alters most effective drugs in designed combinations. Proc Natl Acad Sci U S A. 2014;111:10773–8. doi: 10.1073/pnas.1323934111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ramirez M, Rajaram S, Steininger RJ, Osipchuk D, Roth MA, Morinishi LS, et al. Diverse drug-resistance mechanisms can emerge from drug-tolerant cancer persister cells. Nat Commun. 2016;7:10690. doi: 10.1038/ncomms10690 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lawrence MS, Sougnez C, Lichtenstein L, Cibulskis K, Lander E, Gabriel SB, et al. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517:576–582. doi: 10.1038/nature14129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin A V., et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–421. doi: 10.1038/nature12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481:306–313. doi: 10.1038/nature10762 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501:338–45. doi: 10.1038/nature12625 [DOI] [PubMed] [Google Scholar]
  • 7.Andor N, Graham TA, Jansen M, Xia LC, Aktipis CA, Petritsch C, et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat Med. 2015;22:105–113. doi: 10.1038/nm.3984 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Marusyk A, Polyak K. Tumor heterogeneity: Causes and consequences. Biochim Biophys Acta—Rev Cancer. 2010;1805:105–117. doi: 10.1016/j.bbcan.2009.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100:57–70. doi: 10.1016/s0092-8674(00)81683-9 [DOI] [PubMed] [Google Scholar]
  • 10.Hanahan D, Weinberg RAA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–74. doi: 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]
  • 11.De Bruin EC, McGranahan N, Mitter R, Salm M, Wedge DC, Yates L, et al. Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science (80-). 2014;346:251–256. doi: 10.1126/science.1253462 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Campbell PJ, Yachida S, Mudie LJ, Stephens PJ, Pleasance ED, Stebbings LA, et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature. 2010;467:1109–1113. doi: 10.1038/nature09460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Brock A, Chang H, Huang S. Non-genetic heterogeneity—a mutation-independent driving force for the somatic evolution of tumours. Nat Rev Genet. 2009;10:336–42. doi: 10.1038/nrg2556 [DOI] [PubMed] [Google Scholar]
  • 14.Niepel M, Spencer SL, Sorger PK. Non-genetic cell-to-cell variability and the consequences for pharmacology. Curr Opin Chem Biol. 2009;13:556–561. doi: 10.1016/j.cbpa.2009.09.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Russo A, Franchina T, Rita Ricciardi GR, Picone A, Ferraro G, Zanghì M, et al. A decade of EGFR inhibition in EGFR-mutated non small cell lung cancer (NSCLC): Old successes and future perspectives. Oncotarget. 2015;6:26814–26825. doi: 10.18632/oncotarget.4254 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Xu J, Wang J, Zhang S. Mechanisms of resistance to irreversible epidermal growth factor receptor tyrosine kinase inhibitors and therapeutic strategies in non-small cell lung cancer. Oncotarget. 2017;8:90557–90578. doi: 10.18632/oncotarget.21164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Harris LA, Beik S, Ozawa PMM, Jimenez L, Weaver AM. Modeling heterogeneous tumor growth dynamics and cell–cell interactions at single-cell and cell-population resolution. Curr Opin Syst Biol. 2019;17:24–34. doi: 10.1016/j.coisb.2019.09.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pisco AO, Huang S. Non-genetic cancer cell plasticity and therapy-induced stemness in tumour relapse: “What does not kill me strengthens me”. Br J Cancer. 2015;112:1725–1732. doi: 10.1038/bjc.2015.146 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hinohara K, Polyak K. Intratumoral heterogeneity: more than just mutations. Trends Cell Biol. 2019. [cited 18 May 2019]. doi: 10.1016/j.tcb.2019.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Marusyk A, Almendro V, Polyak K. Intra-tumour heterogeneity: a looking glass for cancer? Nat Rev Cancer. 2012;12:323–334. doi: 10.1038/nrc3261 [DOI] [PubMed] [Google Scholar]
  • 21.Letai A. Functional precision cancer medicine-moving beyond pure genomics. Nat Med. 2017;23:1028–1035. doi: 10.1038/nm.4389 [DOI] [PubMed] [Google Scholar]
  • 22.Huang S. Non-genetic heterogeneity of cells in development: more than just noise. Development. 2009;136:3853–62. doi: 10.1242/dev.035139 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Raj A, van Oudenaarden A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell. 2008;135:216–226. doi: 10.1016/j.cell.2008.09.050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Samoilov MS, Price G, Arkin AP. From fluctuations to phenotypes: the physiology of noise. Sci STKE. 2006;2006:re17. doi: 10.1126/stke.3662006re17 [DOI] [PubMed] [Google Scholar]
  • 25.Raser JM, O’Shea EK. Noise in gene expression: origins, consequences, and control. Science (80-). 2005;309:2010–2013. doi: 10.1126/science.1105891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Rao C V., Wolf DM , Arkin AP. Control, exploitation and tolerance of intracellular noise. Nature. 2002;420:231–237. doi: 10.1038/nature01258 [DOI] [PubMed] [Google Scholar]
  • 27.Sanchez A, Choubey S, Kondev J. Regulation of noise in gene expression. Annu Rev Biophys. 2013;42:469–91. doi: 10.1146/annurev-biophys-083012-130401 [DOI] [PubMed] [Google Scholar]
  • 28.Thomas P, Terradot G, Danos V, Weiße AY. Sources, propagation and consequences of stochasticity in cellular growth. Nat Commun. 2018;9:1–11. doi: 10.1038/s41467-017-02088-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Huh D, Paulsson J. Non-genetic heterogeneity from stochastic partitioning at cell division. Nat Genet. 2011;43:95–100. doi: 10.1038/ng.729 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Risom T, Langer EM, Chapman MP, Rantala J, Fields AJ, Boniface C, et al. Differentiation-state plasticity is a targetable resistance mechanism in basal-like breast cancer. Nat Commun. 2018;9:3815. doi: 10.1038/s41467-018-05729-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hata AN, Niederst MJ, Archibald HL, Gomez-Caraballo M, Siddiqui FM, Mulvey HE, et al. Tumor cells can follow distinct evolutionary paths to become resistant to epidermal growth factor receptor inhibition. Nat Med. 2016;22:262–269. doi: 10.1038/nm.4040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hangauer MJ, Viswanathan VS, Ryan MJ, Bole D, Eaton JK, Matov A, et al. Drug-tolerant persister cancer cells are vulnerable to GPX4 inhibition. Nature. 2017;551:247–250. doi: 10.1038/nature24297 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Paudel BB, Harris LA, Hardeman KN, Abugable AA, Hayford CE, Tyson DR, et al. A nonquiescent “idling” population state in drug-treated, BRAF-mutated melanoma. Biophys J. 2018;114:1499–1511. doi: 10.1016/j.bpj.2018.01.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Das Thakur M, Salangsang F, Landman AS, Sellers WR, Pryer NK, Levesque MP, et al. Modelling vemurafenib resistance in melanoma reveals a strategy to forestall drug resistance. Nature. 2013;494:251–255. doi: 10.1038/nature11814 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hugo W, Shi H, Sun L, Piva M, Song C, Kong X, et al. Non-genomic and immune evolution of melanoma acquiring MAPKi resistance. Cell. 2015;162:1271–1285. doi: 10.1016/j.cell.2015.07.061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Song C, Piva M, Sun L, Hong A, Moriceau G, Kong X, et al. Recurrent tumor cell–intrinsic and–extrinsic alterations during mapki-induced melanoma regression and early adaptation. Cancer Discov. 2017;7:1248–1265. doi: 10.1158/2159-8290.CD-17-0401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Waddington CH. Canalization of development and the inheritance of acquired characters. Nature. 1942;150:563–565. doi: 10.1038/150563a0 [DOI] [PubMed] [Google Scholar]
  • 38.Huang S, Ernberg I, Kauffman S. Cancer attractors: A systems view of tumors from a gene network dynamics and developmental perspective. Semin Cell Dev Biol. 2009;20:869–876. doi: 10.1016/j.semcdb.2009.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Huang S. Genetic and non-genetic instability in tumor progression: link between the fitness landscape and the epigenetic landscape of cancer cells. Cancer Metastasis Rev. 2013;32:423–448. doi: 10.1007/s10555-013-9435-7 [DOI] [PubMed] [Google Scholar]
  • 40.Wales DJ, Bogdan T V. Potential energy and free energy landscapes. J Phys Chem B. 2006;110:20765–20776. doi: 10.1021/jp0680544 [DOI] [PubMed] [Google Scholar]
  • 41.Singh A, Soltani M. Quantifying intrinsic and extrinsic variability in stochastic gene expression models. PLoS ONE. 2013;8. doi: 10.1371/journal.pone.0084301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hilfinger A, Paulsson J. Separating intrinsic from extrinsic fluctuations in dynamic biological systems. Proc Natl Acad Sci U S A. 2011;108:12167–12172. doi: 10.1073/pnas.1018832108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Von Dassow G, Meir E, Munro EM, Odell GM. The segment polarity network is a robust developmental module. Nature. 2000;406:188–192. doi: 10.1038/35018085 [DOI] [PubMed] [Google Scholar]
  • 44.Albert R, Othmer HG. The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol. 2003;223:1–18. doi: 10.1016/s0022-5193(03)00035-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tse MJ, Chu BK, Roy M, Read EL. DNA-binding kinetics determines the mechanism of noise-induced switching in gene networks. Biophys J. 2015;109:1746–1757. doi: 10.1016/j.bpj.2015.08.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wang J, Zhang K, Xu L, Wang E. Quantifying the Waddington landscape and biological paths for development and differentiation. Proc Natl Acad Sci U S A. 2011;108:8257–8262. doi: 10.1073/pnas.1017017108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non–small-cell lung cancer to gefitinib. N Engl J Med. 2004;350:2129–2139. doi: 10.1056/NEJMoa040938 [DOI] [PubMed] [Google Scholar]
  • 48.Ben-David U, Siranosian B, Ha G, Tang H, Oren Y, Hinohara K, et al. Genetic and transcriptional evolution alters cancer cell line drug response. Nature. 2018;560:325–330. doi: 10.1038/s41586-018-0409-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Gupta PB, Fillmore CM, Jiang G, Shapira SD, Tao K, Kuperwasser C, et al. Stochastic state transitions give rise to phenotypic equilibrium in populations of cancer cells. Cell. 2011;146:633–644. doi: 10.1016/j.cell.2011.07.026 [DOI] [PubMed] [Google Scholar]
  • 50.Aranda-Anzaldo A, Dent MAR. Landscaping the epigenetic landscape of cancer. Prog Biophys Mol Biol. 2018;140:155–174. doi: 10.1016/j.pbiomolbio.2018.06.005 [DOI] [PubMed] [Google Scholar]
  • 51.Koizumi F, Shimoyama T, Taguchi F, Saijo N, Nishio K. Establishment of a human non-small cell lung cancer cell line resistant to gefitinib. Int J Cancer. 2005;116:36–44. doi: 10.1002/ijc.20985 [DOI] [PubMed] [Google Scholar]
  • 52.Jia P, Jin H, Meador CB, Xia J, Ohashi K, Liu L, et al. Next-generation sequencing of paired tyrosine kinase inhibitor-sensitive and -resistant EGFR mutant lung cancer cell lines identifies spectrum of DNA changes associated with drug resistance. Genome Res. 2013;23:1434–1445. doi: 10.1101/gr.152322.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tan MC, Quinlan MP, Singh A, Sequist L V., Lynch TJ, Haber DA, et al. Reduced erlotinib sensitivity of epidermal growth factor receptor-mutant non-small cell lung cancer following cisplatin exposure: A cell culture model of second-line erlotinib treatment. Clin Cancer Res. 2008;14:6867–6876. doi: 10.1158/1078-0432.CCR-08-0093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Sharma S V, Lee DY, Li B, Quinlan MP, Takahashi F, Maheswaran S, et al. A chromatin-mediated reversible drug-tolerant state in cancer cell subpopulations. Cell. 2010;141:69–80. doi: 10.1016/j.cell.2010.02.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Harris LA, Frick PL, Garbett SP, Hardeman KN, Paudel BB, Lopez CF, et al. An unbiased metric of antiproliferative drug effect in vitro. Nat Methods. 2016;13:497–500. doi: 10.1038/nmeth.3852 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Tyson DR, Garbett SP, Frick PL, Quaranta V. Fractional proliferation: a method to deconvolve cell population dynamics from single-cell data. Nat Methods. 2012;9:923–8. doi: 10.1038/nmeth.2138 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Frick PL, Paudel BB, Tyson DR, Quaranta V. Quantifying heterogeneity and dynamics of clonal fitness in response to perturbation. J Cell Physiol. 2015;230:1403–1412. doi: 10.1002/jcp.24888 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:1–14. doi: 10.1186/s13059-015-0866-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell. 2012;150:1121–1134. doi: 10.1016/j.cell.2012.08.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. 2018. Available from: https://arxiv.org/abs/1802.03426. [Google Scholar]
  • 61.Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–47. doi: 10.1038/nbt.4314 [DOI] [PubMed] [Google Scholar]
  • 62.Carbon S, Douglass E, Dunn N, Good B, Harris NL, Lewis SE, et al. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. doi: 10.1093/nar/gky1055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23:1274–1281. doi: 10.1093/bioinformatics/btm087 [DOI] [PubMed] [Google Scholar]
  • 65.Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys. 1976;22:403–434. doi: 10.1016/0021-9991(76)90041-3 [DOI] [Google Scholar]
  • 66.Massey FJ Jr. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. 1951;46:68–78. doi: 10.1080/01621459.1951.10500769 [DOI] [Google Scholar]
  • 67.Shaffer SM, Dunagin MC, Torborg SR, Torre EA, Emert B, Krepler C, et al. Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance. Nature. 2017;546:431–435. doi: 10.1038/nature22794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Shaffer SM, Emert BL, Reyes Hueros RA, Cote C, Harmange G, Schaff DL, et al. Memory sequencing reveals heritable single-cell gene expression programs associated with distinct cellular behaviors. Cell. 2020;182:947–959.e17. doi: 10.1016/j.cell.2020.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Schuh L, Saint-Antoine M, Sanford EM, Emert BL, Singh A, Marr C, et al. Gene networks with transcriptional bursting recapitulate rare transient coordinated high expression states in cancer. Cell Syst. 2020;10:363–378.e12. doi: 10.1016/j.cels.2020.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Yaffe MB. Why geneticists stole cancer research even though cancer is primarily a signaling disease. Sci Signal. 2019;12:eaaw3483. doi: 10.1126/scisignal.aaw3483 [DOI] [PubMed] [Google Scholar]
  • 71.Cagan R, Meyer P. Rethinking cancer: Current challenges and opportunities in cancer research. Dis Model Mech. Company of Biologists Ltd; 2017. pp. 349–352. doi: 10.1242/dmm.030007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Dagogo-Jack I, Shaw AT. Tumour heterogeneity and resistance to cancer therapies. Nat Rev Clin Oncol. 2018;15:81–94. doi: 10.1038/nrclinonc.2017.166 [DOI] [PubMed] [Google Scholar]
  • 73.Kaiser J. The cancer stem cell gamble. Science (80-). 2015;347:226–229. doi: 10.1126/science.347.6219.226 [DOI] [PubMed] [Google Scholar]
  • 74.Annett S, Robson T. Targeting cancer stem cells in the clinic: Current status and perspectives. Pharmacol Ther. 2018;187:13–30. doi: 10.1016/j.pharmthera.2018.02.001 [DOI] [PubMed] [Google Scholar]
  • 75.Huang S, Kauffman S. How to escape the cancer attractor: Rationale and limitations of multi-target drugs. Semin Cancer Biol. 2013;23:270–278. doi: 10.1016/j.semcancer.2013.06.003 [DOI] [PubMed] [Google Scholar]
  • 76.Zhao B, Sedlak JC, Srinivas R, Creixell P, Pritchard JR, Tidor B, et al. Exploiting temporal collateral sensitivity in tumor clonal evolution. Cell. 2016;165:234–246. doi: 10.1016/j.cell.2016.01.045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Imamovic L, Ellabaan MMH, Dantas Machado AM, Citterio L, Wulff T, Molin S, et al. Drug-driven phenotypic convergence supports rational treatment strategies of chronic infections. Cell. 2018;172:121–134.e14. doi: 10.1016/j.cell.2017.12.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Imamovic L, Sommer MOA. Use of collateral sensitivity networks to design drug cycling protocols that avoid resistance development. Sci Transl Med. 2013;5:204ra132-204ra132. doi: 10.1126/scitranslmed.3006609 [DOI] [PubMed] [Google Scholar]
  • 79.Basanta D, Gatenby RA, Anderson ARA. Exploiting evolution to treat drug resistance: Combination therapy and the double bind. Mol Pharm. 2012;9:914–921. doi: 10.1021/mp200458e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Lee MJ, Ye AS, Gardino AK, Heijink AM, Sorger PK, MacBeath G, et al. Sequential application of anticancer drugs enhances cell death by rewiring apoptotic signaling networks. Cell. 2012;149:780–94. doi: 10.1016/j.cell.2012.03.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Chylek LA, Harris LA, Tung C-S, Faeder JR, Lopez CF, Hlavacek WS. Rule-based modeling: a computational approach for studying biomolecular site dynamics in cell signaling systems. Wiley Interdiscip Rev Syst Biol Med. 2014;6:13–36. doi: 10.1002/wsbm.1245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Chylek LA, Harris LA, Faeder JR, Hlavacek WS. Modeling for (physical) biologists: an introduction to the rule-based approach. Phys Biol. 2015;12:045007. doi: 10.1088/1478-3975/12/4/045007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Gillespie DT, Hellander A, Petzold LR. Perspective: Stochastic algorithms for chemical kinetics. J Chem Phys. 2013;138:170901. doi: 10.1063/1.4801941 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003;19:524–531. doi: 10.1093/bioinformatics/btg015 [DOI] [PubMed] [Google Scholar]
  • 85.Blinov ML, Schaff JC, Vasilescu D, Moraru II, Bloom JE, Loew LM. Compartmental and spatial rule-based modeling with Virtual Cell. Biophys J. 2017;113:1365–1372. doi: 10.1016/j.bpj.2017.08.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Wu F, Su R-Q, Lai Y-C, Wang X. Engineering of a synthetic quadrastable gene network to approach Waddington landscape and cell fate determination. Shou W, editor. Elife. 2017;6:e23702. doi: 10.7554/eLife.23702 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Huang S. The molecular and mathematical basis of Waddington’s epigenetic landscape: A framework for post-Darwinian biology? Bioessays. 2012;34:149–157. doi: 10.1002/bies.201100031 [DOI] [PubMed] [Google Scholar]
  • 88.Gillespie DT. Stochastic simulation of chemical kinetics. Annu Rev Phys Chem. 2007;58:35–55. doi: 10.1146/annurev.physchem.58.032806.104637 [DOI] [PubMed] [Google Scholar]
  • 89.Zhou JX, Aliyu MDS, Aurell E, Huang S. Quasi-potential landscape in complex multi-stable systems. J R Soc Interface. 2012;9:3539–3553. doi: 10.1098/rsif.2012.0434 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Meyer CT, Wooten DJ, Paudel BB, Bauer J, Hardeman KN, Westover D, et al. Quantifying drug combination synergy along potency and efficacy axes. Cell Syst. 2019;8:97–108.e16. doi: 10.1016/j.cels.2019.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Hardeman KN, Peng C, Paudel BB, Meyer CT, Luong T, Tyson DR, et al. Dependence on glycolysis sensitizes BRAF-mutated melanomas for increased response to targeted BRAF inhibition. Sci Rep. 2017;7:42604. doi: 10.1038/srep42604 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Mood AM. On the asymptotic efficiency of certain nonparametric two-sample tests. Ann Math Stat. 1954;25:514–522. doi: 10.1214/aoms/1177728719 [DOI] [Google Scholar]
  • 93.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013. Available from: https://arxiv.org/abs/1303.3997. [Google Scholar]
  • 94.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Knaus BJ, Grünwald NJ. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour. 2017;17:44–53. doi: 10.1111/1755-0998.12549 [DOI] [PubMed] [Google Scholar]
  • 96.Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I Accuracy assessment Genome Res. 1998;8:175–185. doi: 10.1101/gr.8.3.175 [DOI] [PubMed] [Google Scholar]
  • 97.Ewing B, Green P. Base-calling of automated sequencer traces using phred. II Error probabilities. Genome Res. 1998;8:186–194. doi: 10.1101/gr.8.3.186 [DOI] [PubMed] [Google Scholar]
  • 98.Martincorena I, Raine KM, Gerstung M, Dawson KJ, Haase K, Van Loop, et al. Universal patterns of selection in cancer and somatic tissues. Cell. 2017;171:1029–1041.e21. doi: 10.1016/j.cell.2017.09.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:1–12. doi: 10.1038/s41467-016-0009-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Stoeckius M, Zheng S, Houck-Loomis B, Hao S, Yeung BZ, Mauck WM, et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 2018;19:1–12. doi: 10.1186/s13059-017-1381-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies and species. Nat Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 2019;8:329–337.e4. doi: 10.1016/j.cels.2019.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15:e8746. doi: 10.15252/msb.20188746 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.DeTomaso D, Jones MG, Subramaniam M, Ashuach T, Ye CJ, Yosef N. Functional interpretation of single cell similarity maps. Nat Commun. 2019;10:4376. doi: 10.1038/s41467-019-12235-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Liao Y, Smyth GK, Shi W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656 [DOI] [PubMed] [Google Scholar]
  • 107.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:1–21. doi: 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128. doi: 10.1186/1471-2105-14-128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: An R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976–978. doi: 10.1093/bioinformatics/btq064 [DOI] [PubMed] [Google Scholar]
  • 110.Lopez CF, Muhlich JL, Bachman JA, Sorger PK. Programming biological models in Python using PySB. Mol Syst Biol. 2014;9:646–646. doi: 10.1038/msb.2013.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Harris LA, Hogg JS, Tapia J-J, Sekar JAP, Gupta S, Korsunsky I, et al. BioNetGen 2.2: Advances in rule-based modeling. Bioinformatics. 2016;32:3366–3368. doi: 10.1093/bioinformatics/btw469 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

29 May 2020

Dear Dr Harris,

Thank you for submitting your (somewhat revised) manuscript entitled "A unifying framework disentangles genetic, epigenetic, and stochastic sources of drug-response variability in an in vitro model of tumor heterogeneity" for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Jun 02 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

Decision Letter 1

Roland G Roberts

23 Aug 2020

Dear Dr Harris,

Thank you very much for submitting your manuscript "A unifying framework disentangles genetic, epigenetic, and stochastic sources of drug-response variability in an in vitro model of tumor heterogeneity" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers. Please accept my apologies for the unusual amount of time that the review process has taken during these challenging times.

The reviews of your manuscript are appended below. You will see that the reviewers find the work potentially interesting. However, based on their specific comments and following discussion with the academic editor, I regret that we cannot accept the current version of the manuscript for publication. We remain interested in your study and we would be willing to consider resubmission of a comprehensively revised version that thoroughly addresses all the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript would be sent for further evaluation by the reviewers.

IMPORTANT: Having discussed the reviews with the academic editor, it is clear that the reviewers' comments are strikingly concordant. In each case they are impressed by the scale of your study, and think that is potentially appropriate for publication in our journal. HOWEVER, all of the reviewers request that you considerably improve the structure and presentation of your study, shortening the manuscript and improving the clarity and focus. There are also some requests for additional analyses that will enhance the utility and appeal of the paper for our readership. We encourage you to take these comments to heart and to carry out a thorough and conscientious re-working of the manuscript.

We appreciate that these requests represent a great deal of extra work, and we are willing to relax our standard revision time to allow you six months to revise your manuscript.We expect to receive your revised manuscript within 6 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Resubmission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

In their manuscript, Hayford and coworkers apply a framework relating genetics, epigenetics, and stochasticity to derivatives of the non-small-cell lung cancer line, PC9. The framework is Waddington's epigenetic landscape, where the tent strings underneath the landscape are pulled to varying extents depending on mutations or other perturbations (PMID 29590606). Cell-to-cell heterogeneity rocks the landscape and opens the possibility of a cell escaping one attractor and entering another. The authors deeply describe the molecular and phenotypic characteristics of PC9 subclones along with polyclonal lines from different laboratories, performing exome sequencing, RNA-seq at the population and single-cell level, and fractional proliferation assays The goal is to reconcile mutations, single-cell fluctuations in gene-expression state, and drug-induced proliferation rates by using the framework. It is difficult to say whether they accomplished that goal, but the multi-omics data provide a lot of observations that could be followed up on in the future.

The major issue I have with the manuscript is bloat. There are three types of experiments here—1) exome sequencing, 2) single-cell/bulk transcriptomics, 3) time-lapse recordings of inhibitor-treated cells—and associated analyses. Given the scope of the work and the findings, 87 pages is not impressive but agonizing. The Waddingtonian landscape does not require two pages to introduce, nor do we need a dozen examples of how noise can arise in biology. The experimental results in Figure 6 are a carbon copy of Figure 4 expanded by one PC9 subclone, artificially creating a density to the work that it does not deserve. The Supplementary Note is not a note at all but instead another dozen pages of text that editorializes on the model, gives an introductory primer on the different types of mutational events, and speculates on the role of stochasticity. The only text that really belongs there is the description of the growth model, as it is essential to understand the results shown in the main text. The authors should go back and look critically at every text passage, supplementary figure subpanel, and yet-another analysis toward justifying why each element is absolutely essential for their message.

Quite possibly, there are some interesting results here. If the UMAPs of Fig. 3G, 4G, and 6C are reassembled into a single graph, one sees an outlier transcriptional state in PC9-VU, which is recapitulated almost exactly in the DS9 clone—could this clone be thought of as a cancer stem cell-like founder for PC9s, with an epigenetic landscape that is shallow (or noisy) enough to populate the other basins? Reciprocally, the DS8 clone appears to be trapped in the outlier state—does the increased mutational burden mechanistically explain the altered landscape? Lip service to a mutated ABC transporter and a silent mutation in RELN is insufficient. Real biological connections that are plausibly mechanistic would allow the authors to look for related connections in other NSCLC datasets.

The quantity of comments relates directly to the volume of material provided.

Major points

1. Novelty and utility of the genetic-epigenetic-stochastic (G/E/S) framework. The manuscript pseudo-presents a set of ideas at the beginning as if they have never been described before. In fact, prior work by the authors follows almost the exact same organizing principles, with "genotype" substituted for "drug treatment" (PMID 29590606). It is fine to build off past examples—simply cite the best and most-informative examples and get on with the new results here. Second, while I do not doubt that the G/E/S framework provides a straightforward physics-based analogy, I question whether it provides more than a cartooned explanation of the findings. The framework has no predictive value, and the sketches are so flexible that it appears to be consistent with whatever data are at hand. Is there anything that could falsify the G/E/S framework? A theory that explains everything explains nothing.

2. Lack of clarity on the meaning of "Unknown" exonic mutations. Amidst all the text in the Supplementary Note defining SNPs, indels, and missense mutations, I can find no explanation for how exome sequencing can yield an "Unknown" mutation. The only way I can fathom is if the capture beads pulled down collateral bits of intronic and extragenic sequence, but then those mutations should be omitted from the data reporting as they are not in exons. Most of the mutations reported in Fig. 3D and S10D fall into the "Unknown" category, so it is critical to know what these are.

3. Lack of consistency in exome sequencing results. There are also inconsistencies in the exome data that were not acknowledged. For example, ZNF717 is reported to be mutated in three of the four clones in Fig. 4D. One would expect that frequency of mutation to be picked up in the PC9-VU bulk population, but no ZNF717 mutation is reported in Fig. 3D. Furthermore, only two ZNF717 mutations are reported among the five clones shown in Fig. S10B, which is supposed to be a one-clone expansion of the data in Fig. 4D.

4. Copy number alterations are ignored. The experimental implementation of the G/E/S framework focuses entirely on mutations and does not consider the impact of chromosome stability and copy-number alterations. In a larger, thematically-similar study of HeLa cells (PMID 30778230), copy-number alterations were a driving force for heterogeneity among derivatives of the line. The authors could address this concern without any further experiments by applying inferCNV to their bulk RNA-seq data on each PC9 derivative and clone. CCLE RNA-seq data on PC9s could serve as a "normal" starting reference, assuming that CCLE obtained their PC9s directly from the original source.

5. Equating transcriptional and epigenetic states. The authors acknowledge that epigenetics is more conventionally viewed in terms of histone and DNA modifications. My issue is less with the measurement type and more with the implied lifetime of the state if it is framed epigenetically. For a state to be epigenetic, there should be some degree of heritability before the state "relaxes" over several generations. But, some transcriptional states just randomly switch because of noise or other reasons that have nothing to do with epigenetics (see https://doi.org/10.1101/379016 for a nice description of the distinction between the two). The outlier state in PC9-VU and DS9 would be a prime candidate for assessing lifetime, if there were surface markers that could be sorted for (PMID: 28607484).

6. Proof-by-intimidation reporting of the omics data. Whereas the cFP and DIP experiments are generally clear, well described, and informative, the omics data is presented in manner that overwhelms the reader more than it empowers them. Taking Fig. 3D-E as a case in point, the text is illegible at anything approaching a normal page, the shading that redundantly encodes the bar graph to the right is imperceptible even on a retina display, and the cancer genes list is effectively devoid of information (while being crammed with gridded lines and microscopic font). I would recommend a phased presentation, where the most-critical results are presented in main figures, and the gory details are presented ONCE in the supplementary figures (i.e., Figure S10A-D and removing Fig. 4A, D-F).

7. Excluding trivial clone-to-clone differences. As understood, the PC9-VU clones were isolated from a polyclonal population of PC7-VU cells ectopically expressing H2B-mCherry. The abundance of the H2B marker could readily change the epigenetic state of cells, as mCherry is much larger than H2B, and there are many unpublished anecdotes of H2B-mCherry cells behaving oddly, possibly the result of artifactually opening chromatin that should normally be closed. At the minimum, the authors should assess the relative abundance of H2B-mCherry in the different clones and confirm that the overall expression of the label is not dramatically different among clones.

Minor points

1. The "semantic similarity" is possibly a good way to bring together the mutational data with the transcriptomic data. However, there must be some type of interval estimate on the score and the polyclonal baseline used for comparison. Stability of the scores could be estimated by crossvalidation.

2. I can understand the motivation for "withholding" the DS8 clone in the presentation, but it really exacerbates the bloat of the manuscript. Also, in the presentation of the first four clones, it is not clear why DS9 and DS3 were selected, as they are very similar in their response to EGFR inhibition. Perhaps it is better to frame those two clones as a "within group" comparison to estimate how different the transcriptional states can be and yet yield similar drug-induced responses.

3. The formulation of the parameter inference of the stochastic DIP models is generally well done. However, I am not clear on why the division rates appear to flat-line at a division rate of ~0.045. I hope that was not simply the maximum division rate tested. Also, it is inappropriate to use large p values as a metric of goodness of fit. Much better would be to report the K-S statistic and shade to indicate the minimum value(s) that are deemed of interest.

4. This is not entirely the fault of the authors, but I find the V3 output of the GeneRanger software annoying for GEO uploads. The barcodes file precludes reviewers from quickly spot-checking the raw data against the figures. If the gene names could be exported along with the unique barcodes, that would be preferred.

Typographical points

1. Page 5: The reference to Harris et al. should only refer to 89.

2. Page 7: The statement "The result above illustrate…" is inaccurate. The only thing that can be said is that the mutations are associated with transcriptional differences.

3. Page 11: The statement "However, individual basins are not discernable.." is false. The outlier basin is clearly visible in Fig. 3G.

4. Figure S8: There is no need for a legend here. Just label the data points.

5. Supplemental Note, Page 12. "reduce" should be "reduced".

Reviewer #2:

[identifies himself as Aaron S Meyer]

Hayford et al. provide a framework to deconvolve drug response variability into genetic, epigenetic, and stochastic sources in an in vitro heterogeneous tumor model. They explore the genetic and epigenetic differences that explain drug response variability using mutational impact, single-cell differential gene expression, and semantic similarity of gene ontology analysis across 3 versions of the PC9 NSCLC cell line and 8 sublines of PC9-VU. The authors conclude that the cell line versions, as well as single cell-derived sublines differ at genetic and transcriptomic levels. They argue that their framework could be employed to account for all levels of heterogeneity when evaluating cancer treatment.

Overall, the manuscript presents a comprehensive analysis of heterogeneity within a cell line model. It presents a valuable analysis that would be of interest to the drug response and tumor heterogeneity communities. Before it is suitable for publication, however, there are some major concerns outlined below that must be resolved. I expect the authors can fully address these concerns with text changes and some additional computational analysis.

Major concerns:

* I appreciate that the authors have aimed to be comprehensive in their definitions and discussion. However, parts of the paper seem very wordy. As one example, genetic, epigenetic, and stochastic heterogeneity are defined within separate sections, and then again defined within a supplementary table. If there are suitable references the paper can use to explain these definitions, this could help to focus the paper on the new results presented here. It would also make the manuscript more accessible overall.

* I understand the comparison to the ideas proposed by Waddington but think this need (1) be clarified, and (2) its ultimate usefulness highlighted. The Waddington landscape model specifically describes a developmental process, and how barriers between states might increase as development progresses. Where is the analogous process here? In what way can one use the attractor model as a testable hypothesis, or is this a tautological idea?

* The authors must at least acknowledge that heterogeneity can arise from environmental and pharmacologic differences. There is extensive literature on the contribution of these factors, even within simple tissue culture systems.

* The tense throughout the manuscript is inconsistent which leads to confusion. The authors seem to start by referring to their experiments in the past, and observations in the present, which reads well. However, this changes a few times throughout the text.

* Semantic similarity is an interesting idea for demonstrating conserved molecular programs. Can the semantic similarity score be evaluated compared to a meaningful null model for significance? For example, randomized gene sets of the same size from those genes expressed in PC9? Maybe it can also be shown that the similarity between types of molecular measurements is higher within a cell line (e.g. BR8) as compared to between lines? I am concerned that GO enrichment might simply be a reflection of the genes expressed in this cell line, and that semantic similarity simply scales with the number of impacted genes.

* The author's conclusions from their SSA model are overreaching and inconsistent. I agree that they can conclude their DIP rate measurements are consistent with the SSA model but do not agree that they can say much more. It is unclear to what extent their measurements are powered to reveal other forms of heterogeneity if present. Second, they are using one measure of phenotypic heterogeneity, when "epigenetic" heterogeneity is a much more encompassing term (e.g. drug response to other compounds). Third, their parameter uncertainty is simply that, and not evidence that the cells themselves fluctuate.

Minor comments:

* When comparing the SSA model to DIP measurements, the Anderson-Darling test should be more sensitive to differences as compared to the Kolmogorov-Smirnov test. Also, bootstrap resampling is necessary to prevent bias in the test statistic. This is summarized well here: https://asaip.psu.edu/articles/beware-the-kolmogorov-smirnov-test/

* I would not call PC9 a prototypical or archetypal example of lung cancer. It is a commonly used model, but like any cell line model (or any one tumor) presents a limited view of all lung cancer.

* There are a couple of places where there is the note: "(see below)". It is not clear what the authors are referring to.

* In figure 2B, the y-axis is density—is the magnitude of this quantity meaningful (e.g. in units of cell number)? If not, it would be better to normalize the distributions such that the y-axis would be 0-1.

* In figure 3, it is not clear to which plot "CV=12.84" belongs. Same in figure 4.

Reviewer #3:

Hayford et al tackle the issue of disentangling the genetic, epigenetic and stochastic sources of heterogeneity in cancer response to treatment. The authors tackle this problem with a combined approach of theoretical and mathematical modeling with genomics, single cell transcriptomic and experiments on drug response in three cancer cell lines and several monoclonal sublines derived from one of these three lines. This is a very important problem which has relevant repercussion on drug design and development of therapeutic strategies in cancer. The system and the approached used are clever and effective and the results are interesting. However, I think the authors could have done a better job in analyzing the large amount of data they produced, in particular the transcriptomic data, which could provide a much more detailed and informative characterization of the epigenetic landscape of the system. Moreover, the manuscript is massive, highly redundant and difficult to read because the information is poorly organized and scattered through main text, methods and supplementary note (the pdf is also missing line numbers which makes it difficult to review). Most importantly, the results can be presented in a much more concise and effective way. As the most striking example, almost all the data reported in figure 4 is repeated in figure 6 and supplementary figure S10 (even with different color codes), with the addition of another subline, making figure 4 B C D E F G H and I panels completely redundant. Also, in another example, the authors separate the sublines in different subplots in figure 2C when Figure S4 shows that sublines can be put in the same plot which is more concise and makes it is easier to compare between them. A profound reorganization of the figures and of the results section is necessary before publication.

My major and minor points are detailed below:

Major points

1- Single cell RNA seq data is a tremendous resource to characterize the state of a cell but here we are left with very little information; the authors should extract more interpretable information from their data: For example, why don't they start from UMAP including all their data like the one they have in supplementary fig S7 i.e. both lines and sublines instead of splitting their analysis into three in the main text? Further, UMAP is a powerful visualization technique but it also lacks interpretability without integration or further analyses to characterize what the UMAP dimension 1 and 2 represent biologically. What are the molecular signatures and biological processes that separate the different clusters in the UMAP? What can we learn from these signatures about the epigenetic landscape? Which basin(s) of attraction - or cell state(s) is (are) i.e. associated with drug resistance? The authors could for example use the following tool to improve interpretability of their analysis: https://github.com/YosefLab/VISION.

2- The GO semantic similarity analysis between the genetic and epigenetic level is potentially interesting but we lack information about the results and interpretation of this analysis. For example, the authors provide just an example of the analysis for subline DS8 in supplementary table S2 but they only provide the GO identifiers and not the names making any attempt to understand and interpreting the analysis very hard. What are the biological processes and molecular functions that drive the semantic similarities between the genetic and epigenetic level of the cell lines? Are those processes or function shared among some of the lines or sublines?

3 - In addition if I understand correctly, the authors only compare each subline against the other sublines both at the genetic and RNA seq level, but wouldn't be interesting to compare them also against the line from which they are derived?

4- Why haven't the authors performed single cell RNA seq of DS1?

5 - It would be nice if the authors could confidently conclude that the BR1 lines and the subline DS8 which are both resistant to the drug, achieve resistance through two distinct genetic and/or epigenetic cell states. However From figure 4 G, (or figure 6 C) it seems to me that a minority of the DS8 cells occupy the same epigenetic state of BR1. It is therefore difficult to conclude that the resistance in DS8 is not driven by those cells that are close to BR1, also considering that the bimodal distribution in the DIP rate would suggest that it is a minority of cells that drive the resistance. Maybe the authors could sort and profile mutations and RNA of the high DIP rate and low DIP rate cells separately and look at where those cell maps in the genetic and epigenetic landscape?

6 - As explained above, authors should try to present the analyses of all lines and sublines combined. It doesn't make sense to repeat the analysis and presentation of the sublines data before and after adding the DS8, it's too redundant. Then, they can easily discuss the different nature of the DS8 with respect to the other sublines referring to the same plots.

Minor points

Main text

- First part of results i.e. the lengthy description of the three levels is actually more suitable for introduction and I'd suggest to shorten it to improve readability, or alternatively stick it to the supplementary note.

- At page 7, when describing Fig 3 G i.e. the UMAP: 'we see a clear separation of the features for the cell line version in this space' this is confusing as features has been just used to indicate genes/transcript while describing the feature selection.

I guess what the authors meant is: 'we see a clear separation of the cell lines in the UMAP gene expression space' or something on these lines. The same applies to the similar sentence at page 9.

- At page 7: 'However most genetic mutations do not result in an altered expression of that gene'. Which gene?

- Why don't you describe IMPACT score just in the methods the first time you use it instead of just mentioning in the methods and then describe it in the supplementary note?

- Also check IMPACT is always used in upper case (I have seen at least one instance of lower case impact at page 16)

- You mention you didn't use the default covariates of dNdScv because they are not available for hg38, but you can't your own covariate? There are lots of possible variables (i.e. chromatin state) that impact mutation rate and possibly confound the analyses.

- Cell cycle score: no details are provided on how this was computed.

- Feature selection: why filter out highly expressed genes?

- K means clustering: is it performed on the whole expression matrix or just on the UMAP projection? Also, why k=3? what happens with K=4? Did you try a silhouette analysis to find the optimal number of clusters ?

- In silico modeling. Authors model all sublines with one set of division and death parameters apart from DS8 for which you used two sets. Did they try some sort of model comparison to determine if the sublines are better modeled with a single set of more than one set or they just decided by looking at the distribution of the DIP rate? From the RNA-seq epigenetic landscape it looks like the other lines might also be heterogeneous (i.e. few cells of the DS9 subline occupy the same region of DS8) and so better explained by more than one set of parameters.

Figures

- Figure 2 A and C y axis 'Normalized log 2 cell count'

- Figure 3 A, Figure 4 A, these circular barplots are less readable than simpler linear ones

- Figure 3 D, E and Figure 4 D, E Axes and legends are particularly small and hard to read, please uniform them with the ones on the other panels.

Supplementary note

- Most of the supplementary note is so long, detailed and describing such basic knowledge (i.e. the description of the different types of mutations etc.) which is more appropriate for and introduction to genetics or for a PhD dissertation rather than for this manuscript.

Decision Letter 2

Roland G Roberts

10 Feb 2021

Dear Dr Harris,

Thank you for submitting your revised Research Article entitled "Disentangling genetic, epigenetic, and stochastic sources of cell state variability in an in vitro model of tumor heterogeneity" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor. 

Based on the reviews, we will probably accept this manuscript for publication, provided you satisfactorily address the remaining points raised by the reviewers. Please also make sure to address the following data and other policy-related requests.

IMPORTANT:

a) Please address the remaining concerns raised by reviewers #1 and #2.

b) Please change your title to a more declarative form. We suggest something like "An in vitro model of tumor heterogeneity resolves genetic, epigenetic, and stochastic sources of cell state variability"

c) Many thanks for depositing the code on Github. However, we will need you to make the data underlying your Figures available, either on Github or as supplementary data files (see "Data Policy" requests further down). In addition, your Data Availability Statement currently includes "Additional experimental data will be available from the corresponding author upon request" - this is not compliant with our policy, and we discourage reliance on a single named individual.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods 

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2ABCD, 3ABCDEFG, 4ABCDEFG, 5ABCDEFGH, S3ABC, S4ABCDE, S5AB, S6B, S7ABCDEFGH, S8AB, S9, S10, S11, S12, S13ABCD, S15AB, S16, S17ABC. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

 ------------------------------------------------------------------------

BLOT AND GEL REPORTING REQUIREMENTS:

For manuscripts submitted on or after 1st July 2019, we require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare and upload them now. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements 

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

Hayford et al. PLoS Biol Manuscript #PBIOLOGY-D-20-01505R2

The revision of Hayford et al. has been considerably streamlined. The fog of readability has mostly lifted, but that has uncovered additional conceptual and technical concerns that must be addressed or clarified.

Major points

1. "Distance" on the genetic axis-epigenetic landscape (Fig. 1, 6, S14)—The authors wisely removed the evolutionary branching schematics in these cartoon figures. Now, more attention must be paid to the illustration of their conceptual paradigm. Fig. 1 schematizes the genetic axis as a linear path, one assumed to reflect the chronological accumulation of mutations. However, the vertical axis of Fig. 6 does not appear to encode any such phylogeny and instead is used a simple means to spread out the PC9 derivatives that were studied. Likewise, for the epigenetic landscape, the authors make a big deal about transitioning to adjacent basins of attraction, but what does left-vs.-right on the axis imply? Could all the cartoons be flipped or translated horizontally without any loss of meaning? Are the red-green-blue basins in Fig. 1 supposed to line up vertically or are they NOT supposed to line up? In Fig. 6, what evidence is there that DS3 is state that mutates into DS8? In Fig. S14, the implication is that DS8 derives from DS9. If the reader is not supposed to read into distance or location on the horizontal or vertical axes, what is the point of have the different PC9 derivatives share common axes at all?

2. Phylogenetic relationship between PC9-VU and PC9-MGH (Fig. S2B, S4C, S15)—The revised text gives the impression that PC9-VU and PC9-MGH had diverged from a common ancestor some unknown time ago (Fig. S2B). Lost in the weeds of the first submission was that "99% of the PC9-VU mutations were also seen in PC9-MGH" (Fig. S4C). Assuming the same genetic drift, does that not imply that PC9-VU is much closer to the ancestral line? Also, the main text should make clear that the identity of all the lines/clones was confirmed by the ~1e5 SNPs shared among PC9-MGH, PC9-VU, and PC9-BR1 (Fig. S4C); based on the language, the reader defaults to thinking that one of the lines has been misidentified. Last, it is unclear why the authors used SNPs (instead of somatic mutations not appearing as common variants in dbSNP) to evaluate the phylogenies in Fig. S15. Should not the PC9-BR1 and DS clones all derive from PC9-VU if these analyses are accurate?

3. Inappropriate use of inferCNV (Fig. 3E, 4E)—This reviewer appreciates the use of inferCNV to estimate copy number, but it is important to remind that it is only an estimate that depends on how the algorithm was deployed. Inferring gains-losses of the DS clones by using PC9-VU as the reference (Fig. 4E) is acceptable for estimating chromosomal changes relative to the parental population. Doing the same for PC9-MGH, PC9-VU, and PC9-BR1 using their average gene expression is just wrong, because all it does is map gene expression differences among the three lines to their position on chromosomes. Estimation of true gains-losses would require some type of normal reference, which would be difficult for NSCLC. A better substitute in this circumstance would be to estimate local copy number profiles directly from the copy-number estimates of the bulk whole-exome sequencing for the three.

4. Inappropriate use of UMAP (Fig. S7C, S7F)—UMAP is a fine way to reduce dimensions to preserve local and global relationships, but the dimensions should not be used to calculate subtle distances between centroids (the UMAP developer says as much: https://github.com/lmcinnes/umap/issues/92). These calculations should be removed.

5. Overinterpretation of the stochastic birth-death simulations (Fig. 5, S13)—The choice of bootstrapped Anderson-Darling recommended by Reviewer #2 is fine as a metric to compare data and stochastic simulations. However, the language of the main text is backwards that of standard hypothesis testing for frequentist statistics. A p value greater than 0.05 does not provide "strong evidence that these sublines […] are monoclonal"; only a small p value could provide evidence that the sublines are NOT monoclonal. Instead, the results suggest that the data are consistent with monoclonal population, and Occam's razor argues against invoking more complicated population structures. Compared to the results text, the methods text does a better job describing what these simulations do and do not support.

Minor points

1. Although the manuscript is much more streamlined, there remain instances of superfluous information that is unhelpful. For example, how does Fig. S1 enhance our appreciation that some genomic differences matter and others do not? Elsewhere, Fig. S12 classifies a UMAP plot based on inferred cell-cycle phase. This analysis can be done, but how does provide "further evidence that DS3, DS6, and DS7/9 correspond to three distinct cell states"? What result from the cell-cycle categorization would have falsified that there are three distinct cell states?

2. There remains some confusing duplication in the revision. In the first glance of Fig. 3D, for instance, the question arises why so many gene rows appear to have no mutations. One presumes that it is the same gene rows as shown in Fig. 4D, but the figure captions do not clarify this point. Also, although the figure is too small for row labels, there could be a handful of spaced row numbers that could be used to interpret the figure alongside the row numbers of Table S2.

Reviewer #2:

[identifies himself as Aaron S Meyer]

Hayford et al have substantially improved their manuscript by streamlining the text and figures. I appreciate the analysis provided to address my and the other reviewers' concerns. I have a couple of addressable concerns below and am supportive of publication after these minor adjustments.

Line 101: "…emerged in the absence of selective pressures." This statement could be a bit more exact. While no drug selection was applied, the cell culturing process itself or features of that environment could be providing selective pressures for continued cell line adaptation.

Lines 324-328: Since the semantic similarity relies on input about the difference between sublines, is it not differently powered to find links when there is a greater or lesser difference? Perhaps it would be better to say there is a strong genomic-transcriptomic link in DS8, but then other lines either have a weak link or insufficient changes to establish a link?

Reviewer #3:

The authors have addressed the majority of my concerns. The presentation of the results is now clearer and the manuscript is more concise, therefore I recommend its publication in Plos Biology.

Decision Letter 3

Roland G Roberts

4 Mar 2021

Dear Leonard,

Thank you for submitting your revised Research Article entitled "An in vitro model of tumor heterogeneity resolves genetic, epigenetic, and stochastic sources of cell state variability" for publication in PLOS Biology.

IMPORTANT:

Sorry- we're nearly there, and we just need to finalise the data provision. Please see the Data Policy section further down. The issue is that while you point out that your experimental data are available inthe Github deposition (github.com/QuLab-VU/GES_2021), this seems to contain the raw (-ish) data and code, rather than the numerical values underlying the Figs. So...

a) Please could you add these underlying data to the Github or supply them as supplementary spreadsheets (for Figs 2ABCD, 3ABCDEFG, 4ABCDEFG, 5ABCDEFGH, S3ABC, S4ABCDE, S5AB, S6B, S7ABCDEFGH, S8AB, S9, S10, S11, S12, S13ABCD, S15AB, S16, S17ABC). Or if I'm wrong and have misunderstood the files in Github, please clarify.

b) Either way, please cite the location of the underlying data in each relevant main and supplementary Figure legend, e.g. "The data underlying this Figure can be found in github.com/QuLab-VU/GES_2021" or "The data underlying this Figure can be found in S1 Data"

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods 

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2ABCD, 3ABCDEFG, 4ABCDEFG, 5ABCDEFGH, S3ABC, S4ABCDE, S5AB, S6B, S7ABCDEFGH, S8AB, S9, S10, S11, S12, S13ABCD, S15AB, S16, S17ABC. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

Decision Letter 4

Roland G Roberts

16 Mar 2021

Dear Leonard,

On behalf of my colleagues and the Academic Editor, Mark Siegal, I'm pleased to say that we can in principle offer to publish your Research Article "An in vitro model of tumor heterogeneity resolves genetic, epigenetic, and stochastic sources of cell state variability" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for supporting Open Access publishing. We look forward to publishing your paper in PLOS Biology. 

Sincerely, 

Roli

Roland G Roberts, PhD 

Senior Editor 

PLOS Biology

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Extended discussions regarding genetic, epigenetic, and stochastic sources of tumor heterogeneity.

    Four subsections are included, discussing (i) descriptions of the genetic, epigenetic, and stochastic levels of heterogeneity believed to coexist within tumors, (ii) the fundamental relationship between transcriptomics and epigenetics, (iii) the rationale behind the genetic-to-epigenetic correlation metric utilized in this work, and (iv) 2 hypotheses regarding the origins of the DS8 genetic mutant subline. Three tables are also included containing (A) a glossary of terms, (B) a list of genes associated with mutation heatmaps in Figs 3D and 4D, and (C) rate parameters used for stochastic simulations.

    (PDF)

    S1 Fig. The PC9 cell line family.

    (A) Identification of canonical EGFR-ex19del in PC9 cell line family members. A screenshot from the IGV is shown. Red corresponds to potential deletions and blue to potential insertions. The data underlying this image can be found in the Sequence Read Archive (ncbi.nlm.nih.gov/sra) at accession #PRJNA632351. (B) PC9 cell line family tree. Two versions the PC9 cell line were maintained separately in culture at 2 different institutions (VU and MGH). A resistant cell line (PC9-BR1) was derived from PC9-VU by dose escalation in the EGFRi afatinib. Several DS were also single-cell isolated from PC9-VU. Colors are consistent with data visualizations in main and supplementary figures. DS, discrete subline; EGFR, epidermal growth factor receptor; EGFRi, EGFR inhibitor; IGV, Integrative Genomics Viewer; MGH, Massachusetts General Hospital; VU, Vanderbilt University.

    (TIF)

    S2 Fig. cFP assays for PC9 cell line versions.

    (A) PC9-MGH treated with erlotinib. (B) PC9-BR1 treated with erlotinib. All trajectories in A and B are normalized to approximately 72 h postdrug treatment. (C) PC9-VU treated with erlotinib. Trajectories for the parental line (gray) and the discrete sublines (colors) are plotted together for comparison. All subline trajectories, except DS8, are normalized to approximately 125 h post-erlotinib treatment; DS8 was normalized to the time of treatment because it was resistant and reached confluency during the course of the experiment. For the sublines, means of time point replicates are plotted. In all cases, number of colonies (n) are noted within the plots. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. cFP, clonal fractional proliferation.

    (TIF)

    S3 Fig. Metrics of WES data.

    (A) Total number of mutations identified through variant calling compared to hg38 reference genome. Mutations are separated into substitutions, specifically SNPs and InDels. (B) Sequencing quality metrics for the PC9 cell line family (considered together as one group). DP is a measure of sequence coverage; MQ details how well the sequencing reads are mapped to the reference genome; QUAL is a score developed for Phred base calling that measures the confidence in called variants based on sequencing error probabilities; variant count is a reflection of the variants per site identified over small sections (windows) of the reference genome. (C–E) Quantified Venn diagram (i.e., UpSet plot) of unique, and intersections of, mutations in (C) cell line versions, (D) PC9-VU sublines, and (E) PC9-VU sublines and parental. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. InDel, insertion/deletion; SNP, single nucleotide polymorphism; WES, whole exome sequencing.

    (TIF)

    S4 Fig. Additional mutation analyses across cell line versions and sublines.

    (A) Mutational differences between PC9 cell line family members for a literature-curated set of cancer-associated genes implicated in lung cancer (see Materials and methods). Heatmap elements are colored based on type of mutation. (B) Mutation class pie charts. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. Indel, insertion/deletion; SNV, single nucleotide variant.

    (TIF)

    S5 Fig. Metrics for scRNA-seq comparisons.

    (A) Cell Ranger (support.10xgenomics.com/single-cell-gene-expression/software) output file detailing metrics of sequencing run (quality, mapping, barcode identification, etc.). The data underlying this image can be found in the Gene Expression Omnibus (ncbi.nlm.nih.gov/geo) data repository at accession #GSE150084. (B) Feature identification for genes that transcriptomically differentiate PC9 cell line family members. Variable genes are projected on a plot of dispersion vs. average gene expression and genes that pass a feature selection threshold are shown in red (0.1<average gene expression<8, log variance-to-mean ratio>1; 574 genes). The data underlying this plot can be found in github.com/QuLab-VU/GES_2021. scRNA-seq, single-cell RNA sequencing; UMI, unique molecular identifier.

    (TIF)

    S6 Fig. Clustering and alternative visualizations of scRNA-seq data.

    (A) Clustering of cell line versions. Number of clusters (3) was defined based on majority rule from a consensus of 30 indices. Ward’s minimum variance method was used. (B) Quantification of cluster fraction by cell line version. (C) Same as A but for sublines. Two clusters were found to be the consensus. (D) Same as B but for sublines. (E) PCA visualization of single-cell transcriptomes. (F) t-SNE visualization of single-cell transcriptomes. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; scRNA-seq, single-cell RNA sequencing; t-SNE, t-distributed Stochastic Neighbor Embedding.

    (TIF)

    S7 Fig. Bulk RNA-seq data.

    (A) PCA of single-replicate normalized RNA-seq count data. (B) Hierarchical clustering of RNA-seq normalized count data. Clustering was performed on the pairwise Euclidian distance matrix created from the relative log transformed gene counts using the Ward’s minimum variance method. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; RNA-seq, RNA sequencing.

    (TIF)

    S8 Fig. VISION transcriptome functional interpretation analysis.

    Single-cell gene expression matrix and MSigDB hallmark gene signatures were input to create a signature score for each cell. Scores were totaled for each population across each hallmark and plotted as a density distribution. All 50 hallmark signatures were sampled. Note that “KRAS signaling” and “UV response” had hallmark signatures for both up- and down-regulated. We condensed these 4 signatures into 2, leaving 48 hallmark signatures total. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. MSigDB, molecular signatures database.

    (TIF)

    S9 Fig. GO semantic similarity scores for each GO ontology type.

    Significantly enriched GO terms for each data modality pair (IMPACT mutations and DEGs) were compared for each cell line family member for each GO type (BP, MF, and CC). The top 1,000 similarity scores within each pair were compiled into a distribution to calculate a median (white circle) and 95% confidence interval (error bars). Scores are plotted relative to a baseline, defined as the median + one standard deviation of simulated distributions (dashed lines). Simulated score distributions were calculated based on random gene lists of identical lengths to the experimental gene lists (see Materials and methods). The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. BP, Biological Process; CC, Cellular Component; DEG, differentially expressed gene; GO, Gene Ontology; MF, Molecular Function.

    (TIF)

    S10 Fig. Baseline expression values across chromosomes for PC9-VU parental.

    Values are plotted as a heatmap. All PC9-VU sublines were compared against this baseline in the CNV analysis. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. CNV, copy number variant.

    (TIF)

    S11 Fig. Potential explanations for cell state heterogeneity in DS8.

    (A) Multiple genetic states hypothesis. In this scenario, a genetic resistance mutation was acquired after the DS8 subline was established. Assuming the mutant state does not outgrow the original genetic state (i.e., a “selective sweep”), both genetic states should coexist within the subline. (B) Single genetic state hypothesis. In this scenario, a genetic resistance mutation emerged within the PC9-VU parental population and a cell containing that mutation was isolated to establish the DS8 subline. To explain our single-cell transcriptomics data, we hypothesize that cell–cell interactions between mutant and PC9-VU cells increase the death rate for mutant cells, making them a small proportion (<2%) of the total PC9-VU population.

    (TIF)

    S12 Fig. Comparison of experimental and simulated cFP assays for PC9-VU sublines.

    (A) Experimental cFP time courses for 4 PC9-VU sublines (DS1, DS6, DS7, and DS9) in response to 3 μM erlotinib (same data used to generate DIP rate distributions in Fig 2D of the main text). Each trace corresponds to a single colony, normalized to 72 h postdrug treatment. Only colonies with cell counts greater than 50 at the time of treatment were kept; n represents the number of colony traces for each subline. (B) Simulated cFP time courses generated using division and death rate constants that closely reproduce the experimental time courses in A. Trajectories are normalized to the time at which the simulated drug treatment was initiated and simulated cell counts are plotted only at experimental time points. Although the same number of simulations were initiated as the number of colonies (n) in the corresponding experiment (see panel A), only simulated colonies with cell counts >50 at the time of simulated drug treatment are shown. (C) Comparison of experimental and simulated DIP rate distributions calculated from time courses in A and B. Distributions are compared statistically using the AD test (see Materials and methods). Bootstrapped p-values are shown (mean and standard deviation). Dashed black line signifies zero DIP rate, for visual orientation. (D) Parameter scan of division and death rate constants for the 4 sublines in A–C. For each pair of rate constants, we ran model simulations (same number as corresponding subline), calculated DIP rates, compiled them into a distribution, and then statistically compared against the corresponding experimental DIP rate distribution using the AD test (bootstrapped). All p < 0.05 are colored white, indicating lack of statistical correspondence to experiment. × denotes a division and death rate constant used in B. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. AD, Anderson–Darling; cFP, clonal fractional proliferation; DIP, drug-induced proliferation.

    (TIF)

    S13 Fig. Genomic relatedness between PC9 cell line family members.

    (A) PCA of PC9 genotypes. Using a subset of SNPs in approximate linkage equilibrium, a genetic covariance matrix was calculated. The covariance matrix was converted to a correlation matrix to achieve appropriate scaling and PCA was run to identify SNP eigenvectors (loadings of the principal components). PC9 cell line family members are plotted along the principal component axes. (B) Hierarchical clustering of PC9 genotypes. Using an identity-by-state analysis, a matrix of genome-wide pairwise identities was calculated. Hierarchical clustering was performed on these identities to determine sample relatedness. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. PCA, principal component analysis; SNP, single nucleotide polymorphism.

    (TIF)

    S14 Fig. Sample identification in “hashed” PC9 cell line family members.

    Proportional representation of cell populations with each of 8 specific “hashtag” antibodies, based on the HTO expression level. Each sample has a single corresponding HTO, while a minority of the HTO reads were unmapped. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. HTO, hashtag oligonucleotide.

    (TIF)

    S15 Fig. scRNA-seq quality control analyses.

    (A) Cell hashing allowed for detection of cell multiplets (1 droplet with more than 1 cell) because multiple HTOs would be detected for a single cell barcode (i.e., droplet). Cells were segregated into singlets and doublets (i.e., multiplets). All detected cell transcriptomes were visualized using UMAP, noting singlets and doublets. (B) Automated DD was performed on the detected cell transcriptomes. Doublets were predicted and noted on the same UMAP visualization as in A. (C) Cells were scored based on number of features (e.g., genes) and count of detected RNA molecules. Cells with scores below a specified threshold (see Materials and methods) were classified as “poor” quality. Quality of each cell was noted on the same UMAP visualizations as in A and B. A total of 7,892 cells passed singlet and quality control thresholding. The data underlying this figure can be found in github.com/QuLab-VU/GES_2021. DD, doublet detection; HTO, hashtag oligonucleotide; scRNA-seq, single-cell RNA sequencing; UMAP, Uniform Manifold Approximation and Projection.

    (TIF)

    Attachment

    Submitted filename: ReviewerResponses.pdf

    Attachment

    Submitted filename: ReviewerResponses.pdf

    Attachment

    Submitted filename: ReviewerResponses.pdf

    Data Availability Statement

    The sequencing datasets generated in this study can be found in the gene expression omnibus (GEO; GSE150084) and sequence read archive (SRA; PRJNA631050 and PRJNA632351). Additional experimental data are available on Github (github.com/QuLab-VU/GES_2021). The codes used to generate model simulations and analyze experimental data are publicly available via GitHub (github.com/QuLab-VU/GES_2021).


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES