Abstract
Human cancer cell lines are the workhorse of cancer research. While cell lines are known to evolve in culture, the extent of the resultant genetic and transcriptional heterogeneity and its functional consequences remain understudied. Here, genomic analyses of 106 cell lines grown in two laboratories revealed extensive clonal diversity. Follow-up comprehensive genomic characterization of 27 strains of the common breast cancer cell line MCF7 uncovered rapid genetic diversification. Similar results were obtained with multiple strains of 13 additional cell lines. Importantly, genetic changes were associated with differential activation of gene expression programs and marked differences in cell morphology and proliferation. Barcoding experiments showed that cell line evolution occurs as a result of positive clonal selection that is highly sensitive to culture conditions. Analyses of single cell-derived clones demonstrated that ongoing instability quickly translates into cell line heterogeneity. Testing of the 27 MCF7 strains against 321 anti-cancer compounds uncovered strikingly disparate drug response: at least 75% of compounds that strongly inhibited some strains were completely inactive in others. This study documents the extent, origin and consequence of genetic variation within cell lines, and provides a framework for researchers to measure such variation in efforts to support maximally reproducible cancer research.
Human cancer cell lines have facilitated fundamental discoveries in cancer biology and translational medicine1. An implicit assumption has been that cell lines are clonal and genetically stable, and hence results obtained in one study can be readily extended to another. Yet findings involving cancer cell lines are often difficult to reproduce2,3, leading investigators to conclude that the findings were either weak or the studies not carefully conducted. For example, while pharmacogenomic profiling of large collections of cancer cell lines have proven largely reproducible, some discrepancies in drug sensitivity remain unexplained4–11. We hypothesized that cancer cell lines are neither clonal nor genetically stable, and that this instability can generate variability in drug sensitivity.
Cross-laboratory comparisons
To test the hypothesis that clonal variation exists within established cell lines, we re-analyzed whole-exome sequencing data from 106 cell lines generated by both the Broad Institute (the Cancer Cell Line Encyclopedia (CCLE)) and the Sanger Institute (the Genomics of Drug Sensitivity in Cancer (GDSC)), using the same analytical pipeline for both datasets (Methods).
As expected, estimates of allelic fraction (AF) for germline variants were nearly identical across the two datasets (median r=0.95), indicating that sequencing artifacts do not substantially contribute to the erroneous appearance of low AF calls. However, the degree of agreement in AF for somatic variants was substantially lower (median r=0.86; p<2*10−16; Fig. 1a, Extended Data Fig. 1a and Supplementary Table 1). Moreover, a median of 19% of the detected non-silent mutations (range, 10% to 90%) were identified in only one of the two datasets (Extended Data Fig. 1b). Likewise, 26% of genes altered by copy number alterations (CNAs) (range, 7% to 99%) were discordant (Extended Data Fig. 1c–e). These results indicate that genetic variability across versions of the same cell line is common. Indeed, a median of 22% of the genome was estimated to be affected by subclonal events across 916 CCLE cell lines (Extended Data Fig. 1f), suggesting that subclonality may underlie the observed differences.
Genetic variation across 27 MCF7 strains
We performed extensive genomic characterization of 27 versions (hereafter called “strains”) of the commonly used estrogen receptor (ER)-positive breast cancer cell line MCF7 (ref 12–14; Methods, Extended Data Fig. 1g–n, 2a–b and Supplementary Table 2), including 19 strains that had not undergone drug treatment or genetic manipulation, 7 strains that carried a genetic modification generally considered to be neutral (e.g., introduction of a reporter gene, Cas9 or a DNA-barcode), and one strain (MCF7-M) that had been expanded in vivo in mice following anti-estrogen therapy. Strain M was found to be an outlier, consistent with having been through strong bottlenecks, and was therefore excluded from downstream quantitative analyses.
Ten chromosome arms (25% of the genome) were differentially gained or lost in a pairwise comparison of strains (Supplementary Table 3). We detected 283 genes with copy-gains and 405 genes with copy-losses (compared to basal ploidy) in at least one strain. Only a small minority of these (13% of gains and 21% of losses) were detected in all strains. 7% of gains and 13% of losses were detected in only a single strain, and the remaining events were observed variably across strains (Fig. 1b and Supplementary Table 4). The differential events included genes commonly gained or lost in breast cancer (e.g. TP53, PTEN, EGFR, PIK3CA and MAP2K4; Extended Data Fig. 3a). For example, PTEN was deleted in 17 strains and retained in the other 10 (Fig. 1c). Similarly, the estrogen receptor gene ESR1 was gained in 12 strains, lost in 6, and unaltered in 9 (Fig. 1c), and this correlated with differential expression of ERα (p=0.009; Extended Data Fig. 3b–c and Supplementary Discussion).
Genetic variation was similarly observed at the level of point mutations, small insertions/deletions (indels) and chromosomal translocations. Only 35% of 95 non-synonymous single nucleotide variants (SNVs) and indels affecting coding sequence or splicing were shared by all strains. 29% were unique to a single strain, and the remaining were present in a subset of strains (Fig. 1d–e, Extended Data Fig. 3d–j, Supplementary Tables 5–6 and Supplementary Discussion). Similar, albeit lower, variability was observed among mutations listed as recurrent in the COSMIC database15, consistent with COSMIC mutations tending to be clonal founder mutations (Extended Data Fig. 3f).
Unsupervised hierarchical clustering, where genetic distance is reflected by branch lengths of the dendrogram, generated branch structure that accurately reflected the strains’ history. For example, strain M, which had been subjected to in vivo passaging and drug treatment, was the most genetically distinct; the 11 strains used by the Connectivity Map project16 over a 10 year period clustered tightly together; and sibling strains D and E, merely a few passages apart, were the closest to each other (Fig. 1f–g and Extended Data Fig. 3g). The genetic distance between strains appeared to be affected more by passage number and genetic manipulation than by freeze-thaw cycles (Fig. 1h and Extended Data Fig. 4).
Sources of variation
Analysis of variant AFs revealed extensive subclonality across strains (Fig. 2a–b and Extended Data Fig. 5a). For example, all 27 strains harbored the PIK3CA-activating mutation c.1633G>A, but the AF varied from 0.21 to 0.70 (Extended Data Fig. 5b). Based on AFs and copy number status, 45% of all observed mutations were determined to be subclonal (p<0.01 in a binomial test). PyClone17,18, which reconstructs subclonal structure by clustering mutations with similar cellular prevalence (CP), indicated multiple subclones within each MCF7 strain, with varying abundance across strains (Fig. 2c). Indeed, for 43% of the non-silent SNVs, CP differed by >50% across strains (Extended Data Fig. 5c–d and Supplementary Table 7).
We next asked whether clonal dynamics were stochastic or the product of selection. We barcoded MCF7 cells (strain D) and evaluated the change in barcode representation over time under five culture conditions, each in five replicates. We reasoned that if clonal dynamics were stochastic, distinct barcoded populations would emerge in independent replicates. In contrast, if pre-existing subclones were selected under different conditions, enrichment of the same barcodes would be observed in replicate cultures19. Unsupervised hierarchical clustering by barcode representation revealed that biological replicates clustered together (Fig. 2d and Supplementary Table 8), indicating that pre-existing clones are indeed selected by changes in culture conditions.
Next, we characterized the genetic stability of three wild-type (WT) single cell-derived MCF7 clones and five single cell-derived clones with a “neutral” genetic manipulation (stable expression of a luciferase reporter; Methods, Extended Data Fig. 5e and Supplementary Tables 9–10). Clones derived from the same parental population differed in their mutational landscapes: a median of 15% of the non-silent SNVs detected in the WT parental population (range, 13% to 16%), were not observed in their single cell-derived progeny, or vice versa (Extended Data Fig. 5f–g).
Moreover, the single-cell clones continued to evolve into heterogeneous populations. We propagated two clones for 8–14 months, and sequenced their DNA at multiple time points (Supplementary Tables 9–10). A median of 13% of the non-silent SNVs (range, 8% to 16%) were not shared between time points (Extended data Fig. 5g). Similar results were observed based on cytogenetic analysis (Extended Data Fig. 5h–k and Supplementary Table 11), indicating that even single cell-derived clones are genomically unstable.
Gene expression variation
We next measured transcriptomic variation across the MCF7 strains using the L1000 assay16,20,21 (Supplementary Table 12). Despite an overall similarity in their global gene expression profiles (Fig. 3a and Extended Data Fig. 6a), the 27 strains also showed extensive expression variation: a median of 654 genes (range, 10–1,574) were differentially expressed by at least two-fold between pairs of strains (p<0.05, q<0.05), and the differentially expressed genes converged on important biological pathways (Extended Data Fig. 6b–d and Supplementary Table 13). Importantly, the 27 strains clustered similarly in the space of mutations and expression profiles, and the expected downstream consequences of genetic mutations were observed in the gene expression variation (Fig. 1f–g, Fig. 3b–g, Extended Data Fig. 6e–i and Supplementary Table 14). For example, strains with inactivating PTEN mutation or activating PIK3CA mutation exhibited decreased PTEN and increased mTOR gene expression signatures, respectively (Fig. 3e–f and Extended Data Fig. 6g–i). Similarly, copy loss of ESR1 was associated with reduced estrogen signaling (Fig. 3g and Extended Data Fig. 6g).
We further explored gene expression heterogeneity by single-cell RNA sequencing (scRNA-seq) of 26,465 individual cells from two parental and four single cell-derived clones (Methods, Extended Data Fig. 6j–r and Supplementary Discussion). Unsupervised clustering showed that cells from the single cell-derived clones did not cluster independently, but were mixed with the parental population, indicating high similarity in overall gene expression (Fig. 3h and Extended Data Fig. 6o). Interestingly, the extent of expression heterogeneity among the single cell-derived clones was not substantially lower than that seen in the parental population (Extended Data Fig. 6p), and increased with time in culture (Extended Data Fig. 6q–r, Supplementary Table 15 and Supplementary Discussion). These results suggest that variation in gene expression arises de novo, in addition to reflecting selection of pre-existing clones22.
Verification in additional cell lines
To exclude the possibility that the variation observed across MCF7 strains was unique to that cell line, we repeated genomic analyses on 23 strains of the commonly used lung cancer cell line A549 (ref. 23) (Extended Data Fig. 2c–d and Supplementary Tables 16–20). We observed a similar level of molecular variation across these strains (Extended Data Fig. 7). For example, loss of CDKN2A, the most significantly deleted gene in lung adenocarcinomas24, was detected in 5 strains, but normal copy number was retained in the other 18 (Extended Data Fig. 7f). Whereas transcriptome analyses showed that estrogen signaling was the most variable pathway in MCF7 cells (Extended Data Fig. 6c and Supplementary Table 13), KRAS signaling was the most variable pathway in A549 (Extended Data Fig. 7n and Supplementary Table 20), a commonly used model of KRAS-dependent cancer.
The generalizability of our findings was further confirmed by deep targeted sequencing of multiple strains from 11 additional cell lines (Supplementary Tables 21–24 and Extended Data Fig. 8). Notably, genomic instability was not limited to transformed cancer cell lines (Supplementary Discussion). For example, the variation across 15 strains of MCF10A25, a non-transformed human mammary cell line, was as high as that seen in MCF7 cancer cells (median discordance, 26%; range, 17% to 40%; Extended Data Fig. 8a,h).
Functional consequences of genomic variation
The extensive genomic variation across strains was associated with variation in biologically meaningful cellular properties. We examined several measures of basic cellular function, including doubling time and cell morphology, using quantitative live cell imaging26 (Methods). MCF7 strains varied in doubling times by as much as 3.5-fold (median, 31h; range, 22–78h) (Extended Data Fig. 9a–b). Similarly, cell size and shape were highly variable across strains (Extended Data Fig. 9c–f and Supplementary Table 25). Clustering based on morphological traits echoed that based on genomics or transcriptomics (Extended Data Fig. 9g), and genomic features correlated with proliferation (Extended Data Fig. 9h–i and Supplementary Discussion).
Genomic instability also had major impact on drug response. We measured cell viability following treatment with 321 drugs at a single concentration (5μM) across the 27 MCF7 strains (Supplementary Table 26). 55 compounds had strong activity (>50% growth inhibition) against at least one strain. However, at least one strain was entirely resistant (<20% growth inhibition) to 48 of 55 (87%) active compounds (Fig. 4a–b and Extended Data Fig. 10a). The same phenomenon was observed at a more stringent threshold: of 42 compounds with strong activity in at least two strains, 33 (79%) were inactive in at least two strains (Extended Data Fig. 10b,c,d,j and Supplementary Discussion). All 33 differentially active compounds validated in an 8-point dose-response testing of each of the 27 strains (median Spearman’s rho=0.42 between screens, p=3*10−9; Extended Data Fig. 10k, Supplementary Table 27 and Supplementary Discussion).
The high degree of variability in drug response cannot be explained by irreproducibility of the assay. First, replicate treatments yielded highly concordant results (median Pearson’s r=0.97, p<2*10−16) (Extended Data Fig. 10l). Second, compounds with the same mechanism-of-action had similar patterns of activity across strains (Fig. 4a,c; p=3*10−7). For example, the same activity pattern was observed for three proteasome inhibitors (bortezomib, MG-132 and carfilzomib) (Fig. 4d), and was associated with biochemically-measured differential proteasome activity (Extended Data Fig. 10m–o). Third, for 82% of differentially active compounds, we found differential gene expression signatures of compound mechanism-of-action 27 between sensitive and insensitive strains (p=2*10−5; Fig. 4e–h, Extended Data Fig. 10p–u and Supplementary Tables 28–29).
Indeed, drug response was associated with transcriptional differences in relevant pathways. For example, strains sensitive to CDK inhibitors had an upregulated cell cycle signature, and strains sensitive to PI3K inhibitors had an upregulated mTOR signature (Fig. 4f–g and Extended Data Fig. 10p–q). Interestingly, the strains most resistant to treatment in general (strains M and Q) downregulated a signature of drug metabolism (Extended Data Fig. 10v). Differences in proliferation rate did not explain the majority of the observed differential drug activity (median Spearman’s rho=0.017; p=0.60; Supplementary Table 30).
Genetic variation could be linked directly to differential drug response. For example, genetic inactivation of PTEN was associated with decreased PTEN and increased AKT expression signatures (Fig. 1c,e and Fig. 3e–f), and increased sensitivity to the AKT inhibitor IV (Fig. 4h–i). Similarly, ESR1 loss was associated with reduced estrogen signaling (Fig. 1c and 3g), which was in turn associated with reduced sensitivity to tamoxifen or estrogen depletion (Fig. 4j and Extended Data Fig. 10w–x). More broadly, clustering of the MCF7 strains based on their drug response was highly similar to clustering based on genetics or gene expression (Fig. 1g, 2a, 3b, 4a, Extended Data Fig. 11a and Supplementary Discussion). Genome-wide CRISPR screens revealed that genetic dependencies were affected by genomic variation similarly to pharmacological dependencies (Extended Data Fig. 11b–f, Supplementary Table 31 and Supplementary Discussion), and functional analyses revealed that single cell-derived clones remained phenotypically unstable (Extended Data Fig. 11g–i and Supplementary Discussion).
We thus hypothesized that variation across otherwise isogenic strains might be harnessed to discover mechanisms of drug sensitivity and resistance. Indeed, we found that basal gene expression profiles across the 27 MCF7 strains could be more readily connected to mechanism-of-action of active drugs than did larger panels of breast cancer cell lines derived from different patients5,8 (Fig. 4k, Supplementary Table 32 and Supplementary Discussion).
Discussion
Our results show that established cancer cell lines, generally thought to be clonal, are in fact highly genetically heterogeneous across strains. This heterogeneity results both from clonal dynamics (i.e., changes in the abundance of pre-existing clones) and from ongoing instability (i.e., the appearance of new genetic variants). Moreover, genetic heterogeneity yields varying patterns of gene expression, which in turn result in differential drug sensitivity. These findings have a number of important implications summarized in Extended Data Table 1.
We found that changes in clonal composition underlie much of the observed variability in cell line behavior. Such clonal composition changes follow selection by particular conditions (e.g., growth media), or by genetic manipulations associated with a population bottleneck. The genetic distance between cell line strains was strongly correlated with their gene expression distance and with their drug response distance. Cell line diversification can therefore be estimated using inexpensive profiling methods (Extended Data Fig. 11j). To facilitate routine assessment of cell line diversification, we have created the Cell STRAINER (STRAin INstability profilER) portal (https://cellstrainer.broadinstitute.org), where users can upload cell line genomic data and measure the strain’s genetic distance from a reference.
Variation within cancer cell lines can also be useful in at least two ways. First, deeper characterization (e.g., by single-cell sequencing) of the heterogeneity within cultures of common cell lines could enable the study of cooperative and competitive interactions between cancer cell populations28,29, and mechanisms of pre-existing drug resistance19. Second, due to their matched genetic background, naturally-occurring “isogenic-like” strains could help uncover the association between molecular features and phenotypes such as drug response.
We conclude that cancer cell lines remain a powerful tool for cancer research, but the high degree of variation across cell line strains must be considered in experimental design and data interpretation.
Online Methods
Cell culture
MCF7, HT29, MDAM453 and A375 cell line strains were cultured in RPMI-1640 (Life Technologies), with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). A549, VCaP, PC3, HCC515, HepG2, HeLa and Ben-Men-1 cell line strains were cultured in DMEM (Life Technologies), with 10% Fetal Bovine Serum (Sigma), 2mM Glutamine (Sigma-Aldrich), and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). HA1E cell line strains were cultured in MEMα (Life Technologies), with 10% Fetal Bovine Serum (Sigma), 2mM Glutamine (Sigma-Aldrich), and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). MCF10A cell line strains were cultured in MEGM Mammary Epithelial Cell Growth Medium (Lonza) supplemented with the MEGM Bulletkit (Lonza). The single cell-derived clones scWT3, scWT4 and scWT5, as well as their parental MCF7 population, were cultured in DMEM (Life Technologies), with 10% Fetal Bovine Serum (Sigma), 2mM Glutamine (Sigma-Aldrich), 1% Penicillin-Streptomycin-Glutamine (Life Technologies), and 10μg/mL Insulin (Sigma-Aldrich). Cells were incubated at 37°c, 5% CO2, and passaged twice a week using Trypsin-EDTA (0.25%) (Life Technologies). All strains of the same cell line were cultured under the same conditions, cell identity was confirmed and they were confirmed to be mycoplasma-free. Cells were tested for mycoplasma contamination using the MycoAlert™ Mycoplasma Detection Kit (Lonza), according to the manufacturer’s instructions. Cell line identity was confirmed using SNP-based DNA fingerprinting (see below).
Derivation of single-cell clones
The WT single cell-derived MCF7 clones were generated by cell sorting. Single cells were sorted into individual wells of 96-well plates, using BD FACSAriaII SORP Cell Sorter. Three resultant clones were expanded for a period of ~3 months before prior to the experiments. The genetically-manipulated single cell-derived MCF7-GREB1 and MCF7-ESR1 clones were generated using CRISPR/Cas9 mediated genome engineering to insert a NanoLuciferase reporter gene into the 3’-UTR of the respective genes. Briefly, a selectable reporter gene cassette was engineered using the EMCV IRES element to drive expression of the destabilized NLucP reporter gene (Promega) fused to the N-terminus of the Bsr blasticidin-resistance gene (Invivogen) containing a P2A self-cleaving peptide element. For targeting GREB1, the reporter gene cassette was subcloned into a construct containing ~2 kb of GREB1 gene surrounding the termination codon in exon 33, such that reporter gene cassette is located 9 bp downstream of the GREB1 termination codon in the resulting mRNA hybrid transcript. A GREB1-specific sgRNA was generated recognizing the sequence GCTGACGGGACGACACATCTG on the sense strand, and utilizing a PAM site that is adjacent to the GREB1 termination codon. For targeting ESR1, the reporter gene cassette was subcloned into a construct containing ~2 kb of ESR1 gene surrounding the termination codon in exon 8, such that reporter gene cassette is located 21 bp downstream of the ESR1 termination codon in the resulting hybrid mRNA transcript. An ESR1-specific sgRNA was generated recognizing the sequence GTCTCCAGCAGCAGGTCATAG on the anti-sense strand, and utilizing a PAM site that is 160 bp upstream of the ESR1 termination codon. Corresponding Cas9-sgRNA and targeting construct pairs were transiently co-transfected into MCF7 cells using the LipofectAMINE 2000 reagent (Thermo-Fisher Scientific). After outgrowth for 7 days, the cells were cultured in media containing 5 μg/ml blasticidin to select for the desired recombinants. Single-cell clones were then isolated by a limiting dilution single-cell cloning in 96-well plates.
Growth rate analysis
Cells were seeded in triplicates in white, clear bottom, 96-well plates (Corning #3903), at a density of 5,000 cells/well. Plates were incubated in an IncuCyte ZOOM instrument (Essen Bioscience) at 37°C, 5% CO2. Four non-overlapping phase-contrast images (10X) were taken every 2 hours for a total of 160 hours. The IncuCyte ZOOM Software (version 2015A) was used to calculate the mean confluence per well at every time point (filtered to exclude objects smaller than 100 μm2), and averaged across wells to calculate the mean confluence per strain. Doubling times were calculated for each strain, using the formula Tdoubling = (log2*ΔT)/(log(c2)-log(c1)), where c1 and c2 were the minimum and maximum percent confluence during the linear growth phase, respectively, and ΔT was the time elapsed between c1 and c2. To account for potential differences in cell recovery following seeding, t=0 h was defined as the first time point in which the mean strain confluence surpassed a threshold of 15%. To examine the effect of estrogen-depletion on the growth of MCF7 strains, cells were cultured either in standard conditions (described above) or in estrogen-depleted conditions: RPMI-1640 without Phenol Red (Life Technologies), with 10% Charcoal Stripped Fetal Bovine Serum (Life Sciences) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). Comparison between standard and estrogen-depleted conditions was performed by calculating the fold-change in doubling time between the two conditions.
Cell Painting
Cells were plated in triplicate at a density of 1,000 cells per well, and then stained and fixed as previously described26,30. Images were taken on a Perkin-Elmer Opera Phenix microscope with a 20X/1.0NA water immersion lens. Image quality control was carried out as previously described31, using CellProfiler30 and CellProfiler-Analyst31. For all 27 MCF7 strains, the majority of images in all three wells passed quality control, and therefore all strains were further considered. Image illumination correction and analysis were performed in CellProfiler. For each of the 27 MCF7 strains, the median value of the 1,784 measured features was computed and used for hierarchical clustering.
DNA & RNA extraction
Genomic DNA was extracted using the DNeasy Blood & Tissue Kit (Qiagen), according to the manufacturer’s protocol. Total RNA was extracted using the RNeasy Plus Mini Kit (Qiagen), according to the manufacturer’s protocol.
DNA fingerprinting
Fingerprinting analysis was performed using 44 polymorphic loci. Picard Tools “GenotypeConcordance” was used to calculate the concordance between every pair of samples (for the MCF7 and A549 cohorts, separately). Samples with >95% concordance were called a match.
Ultra low pass whole-genome DNA sequencing (ULP-WGS)
Copy number characterization was performed using low-pass (~0.2x coverage) whole-genome sequencing. Libraries were prepared from 50ng of DNA using ThruPLEX-DNAseq sample preparation kits (Rubicon Genomics), according to the manufacturer’s protocol. The resultant libraries were quantified by Qubit fluorometer, Agilent TapeStation 2200, and RT-qPCR using the Kapa Library Quantification kit (Kapa Biosystems), according to the manufacturer’s protocol. Uniquely indexed libraries were pooled in equimolar ratios and sequenced on a single Illumina NextSeq500 run with paired-end 35bp reads, at the Dana-Farber Cancer Institute Molecular Biology Core Facilities. The reads were aligned to the UCSC hg19 reference genome, using “BWA-MEM” (v0.07.15), with default parameters.
ULP-WGS data analysis
The ichorCNA algorithm32 was applied to identify copy number alterations (CNAs) of large genomic segments, chromosome arms and whole chromosomes. First, the genome was divided into 1Mb bins and read counts were generated for each bin using the HMMcopy Suite (http://compbio.bccrc.ca/software/hmmcopy/). The raw read counts were then normalized to correct for GC-content and mappability biases using the HMMcopy R package33, generating corrected read counts for each 1Mb bin. Segmentation and copy number prediction for each sample were performed using ichorCNA v0.1.0 (https://github.com/broadinstitute/ichorCNA), which is optimized for low coverage whole-genome sequencing. Parameters were initialized based on prior knowledge: --normal=0.01, --ploidy=c(3, 3.5, 4), --txnE=0.99999 --txnStrength=10,000, --maxCN=8. Remaining parameters were set to the default. For bin-level comparison between strains, we used the log2-transformed corrected read counts and determined gain and loss status using thresholds of 0.1 and −0.1, respectively. For arm-level calls, the copy number status was determined based on the largest overlapping segment.
Deep targeted sequencing (Profile OncoPanel v3)
Deep (~250x coverage) targeted exon sequencing of 447 genes commonly mutated in cancer was performed. Prior to library preparation, DNA was fragmented (Covaris sonication) to 250bp and further purified using Agentcourt AMPure XP beads. Size-selected DNA was ligated to sequencing adaptors with sample-specific barcodes during automated library preparation (SPRIworks, Beckman-Coulter). Libraries were pooled and sequenced on an Illumina Miseq to estimate library concentration based on the number of index reads per sample. Library construction was considered to be successful if the yield was ≥250ng, and all samples yielded sufficient library. Normalized libraries were pooled in batches, and hybrid capture was performed using the Agilent Sureselect Hybrid Capture kit with the POPv3_824272 bait set34. The list of 447 genes included in POPv3_824272 is provided as Supplementary Table 2. Captures were then pooled and sequenced on one HiSeq3000 lane. Pooled sample reads were de-convoluted and sorted using the Picard tools (http://broadinstitute.github.io/picard). The reads were aligned to the reference sequence b37 edition from the Human Genome Reference Consortium using “bwa aln” (http://bio-bwa.sourceforge.net/bwa.shtml ), with the following parameters: “-q 5 -l 32 -k 2 -o 1”, and duplicate reads were identified and removed using the Picard tools35. The alignments were further refined using the GATK tool for localized realignment around indel sites (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php). Recalibration of the quality scores was also performed using GATK tools (http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr )36,37. Metrics for the representation of each sample in the pool were generated on the unaligned reads after sorting on the barcode (http://broadinstitute.github.io/picard/picard-metric-definitions.html). All samples achieved the CCGD recommended threshold of >30x coverage for >80% of the targeted bases. Average mean exon target coverage was 251.5x (range: 171.5x-336.7x) for the MCF7 samples, 288.9x (range: 208.2x-398.9x) for the A549 samples, and 257.32 (range 211.7x-442.68x) for the additional cell line samples.
Targeted sequencing data analysis
Mutation analysis for single nucleotide variants (point mutations, or SNVs) was performed using MuTect v1.1.438. Indel calling was performed using the SomaticIndelDetector tool in GATK ((http://www.broadinstitute.org/cancer/cga/indelocator). Consecutive variants in the same codon were re-annotated to maximize the effect on the codon and marked as “Phased” variants. MuTect was run in paired mode, pairing the MCF7/A549 samples to a normal sample, CEPH1408. Mutations were called if detected in >2% of the reads (AF>0.02). All SNVs, indels, and phased variants were annotated with Variant Effect Predictor (VEP)39. Variants were filtered against the 6,500 exome release of the Exome Sequencing Project (ESP) database. Variants represented more than once in either the African- or European-American populations and less than twice in COSMIC were considered to be germline (given that no matched normal samples were available). A germline filter was not applied to the downstream analyses, however, as changes in such mutations between strains of the same cell line would have to arise in culture and may be functionally relevant. Non-silent mutations were considered to be those with the following BestEffect Variant Classification: missense, initiator codon, nonsense, splice acceptor, splice donor, splice region, frameshift, inframe insertion or inframe deletion. Mutations that appeared more than once in COSMIC were regarded as COSMIC mutations. The complete list of variants (SNVs, indels, and phased) for MCF7, A549 and additional cell lines are provided as Supplementary Tables 5, 17 and 23, respectively.
Copy number variants (CNVs, or CNAs) were identified using RobustCNV, an algorithm that relies on localized changes in the mapping depth of sequenced reads in order to identify changes in copy number at the sampled loci (Ducar et al. Manuscript in preparation). Systematic bias in mapping depth was reduced using robust regression, fitting the observed tumor mapping depth against a panel of normal samples (PON) captured using the same bait set. Observed values were then normalized against predicted values and expressed as log2ratios. A second normalization step was performed to remove GC bias, using a loess fit. Log2ratios were centered on segments determined to be diploid based on the allele fraction of heterozygous SNPs in the targeted panel. Normalized coverage data were next segmented using Circular Binary Segmentation40 with the “DNAcopy” Bioconductor package. Finally, segments were assigned gain, loss, or normal-copy calls using a cutoff derived from the within-segment standard deviation of post-normalized mapping depths. Due to the high data quality and low within-segment standard deviation, a cutoff of ~0.1 was applied to all samples. Segment calls were summarized to gene calls by assigning them to capture intervals, and then counting the interval calls for each gene. Gene level calls were determined according to the following rules: “gain” = “+” calls >50%; “loss” = “-” calls >2 or in 100%; “gain+loss” = “-” calls >2 times and “+” calls <50%; “mixed” = “+” and “-” calls in the same gene, but below threshold; “Normal+” = “+” calls, but below threshold; “Normal-” = “-” calls, but below threshold; “Normal” = no “+” or “-” calls. The complete list of CNAs for MCF7, A549 and additional cell lines are provided as Supplementary Tables 4, 16 and 22, respectively.
For a subset of 60 genes (listed in Supplementary Table 2), rearrangements (structural variants, or SVs) were detected using BreaKmer41, which is designed to detect larger genomic structural variations from single sample aligned short read target-captured high-throughput sequence data. Briefly, the method extracts ’misaligned’ sequences from a targeted region, such as split-reads and unmapped mates, assembles a contig from these reads, and re-aligns the contig to make a variant call. It classifies detected variants as ”insertions/deletions”, ”tandem duplications”, ”inversions”, and ”translocations”. The complete list of structural variants for MCF7 and A549 are provided as Supplementary Tables 6 and 18, respectively. Rearrangements were visualized using the “Circos” visualization tool69.
Clonality analysis
To resolve clonal dynamics/composition we applied the PyClone algorithm v0.13.0 (https://bitbucket.org/aroth85/pyclone/wiki/Home) to the measured allelic fractions, accounting for copy number, LOH and cellularity17. PyClone enabled us to follow clonal dynamics throughout the evolution of cell populations17,18. For copy number input, we used results from ichorCNA segmentation and copy number predictions. Mutations with <50 read depth were excluded. The following parameters were used for PyClone: 10,000 iterations, 1,000 burn-in, “total_copy_number” for the prior. We also performed multi-sample analysis using PyClone, to determine the changes in clonal composition across strains. For the multi-sample analysis, mutations were selected as the union set across all 27 strains. The same parameters were used for PyClone multi-sample analysis as for the individual-sample runs.
DNA Barcoding experiment
Degenerate oligos for sgRNA-barcode library construction were synthesized by IDT and cloned into lentiGuide-Puro42 by Gibson assembly, as describe in Joung et al.43. Approximately 300μg of Gibson product was transformed into 25μL of Endura electrocompetent cells (Lucigen). After a 1 hour recovery period, 0.1% of transformed bacteria were plated in a 10-fold dilution series on ampicillin plates to determine the number of successful transformants. The remainder of the transformed bacteria were cultured in 50mL of LB with 50ug/mL ampicillin for 16 hours at 30°c. Plasmid libraries were extracted using Plasmid MidiPlus kit (Qiagen) and sequenced to a depth of 6.2 million reads on Illumina Miniseq, corresponding to 6X coverage of >1 million barcodes. Lentivirus was prepared by transfecting a total of 10 million HEK 293FT cells, as described in Joung et al.43. The MCF7-D strain was cultured in standard conditions (described above), and four million cells were infected with a low multiplicity of infection (20–30%) to reduce the probability of each cell being infected with more than one barcode. Cells underwent puromycin selection, and the final cell pool contained ~160,000 unique barcodes. Cells were expanded for the experiment, and five million cells were then plated into each of 25 tissue culture flasks. Five culture conditions were then applied (with five replicates per condition): 1) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies); 2) DMEM (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies); 3) RPMI-1640 without Phenol Red (Life Technologies), with 10% Charcoal Stripped Fetal Bovine Serum (Life Sciences) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies); 4) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich), 1% Penicillin-Streptomycin-Glutamine (Life Technologies) and 0.05% DMSO (Sigma-Aldrich); 5) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies), supplemented for the first 48 hours with 500nM bortezomib (Selleckchem S1013). After five weeks of culture, DNA was extracted and barcode abundance was assessed by DNA sequencing, as described in Joung et al.43. Libraries were sequenced to a median depth of 4.2 million reads, corresponding to a barcode coverage of >26X.
Transcriptional profiling with L1000
The L1000 expression-profiling assay was performed as previously described16. First, mRNA was captured from cell lysate using an oligo dT coated 384 well Turbocapture plate. The lysate was then removed, and a reverse transcription mix containing MMLV was added. The plate was washed and a mixture containing both upstream and downstream probes for each gene was added. Each probe contained a gene specific sequence, along with a universal primer site. The upstream probe also contained a microbead-specific barcode sequence. The probes were annealed to the cDNA over a 6-hour period, and then ligated together to form a PCR template. After ligation, Hot Start Taq and universal primers were added to the plate. The upstream primer was biotinylated to allow later staining with strepdavodin-phycorethrin. The PCR amplicon was then hybridized to Luminex microbeads via the complimentary, probe-specific barcode on each bead. After overnight hybridization the beads were washed and stained with strepdavodin-phycorethrin to prepare them for detection in Luminex FlexMap 3D scanners. The scanners measured each bead independently and reported the bead color/identity and the fluorescence intensity of the stain. A deconvolution algorithm converted these raw fluorescence intensity measurements into median fluorescence intensities for each of the 978 measured genes, producing the GEX level data. This GEX data was then normalized based on an invariant gene set, and then quantile normalized to produce QNORM level data. An inference model was applied to the QNORM data to infer gene expression changes for a total of 10,174 features. Per-strain gene expression signatures were calculated using a weighted average of the replicates, where the weights are proportional to the Spearman correlation between the replicates.
Transcriptional profiling data analysis
To examine how newly profiled MCF7 and A549 cells compared in gene expression to a previously acquired collection of cell line profiles (untreated samples that served as controls for Connectivity Map perturbational experiments), we used t-distributed stochastic neighbor embedding (t-SNE). Profiles were restricted to untreated profiles from the nine core Connectivity Map cell lines, and to batches with multiple untreated profiles. As samples first clustered based on their project codes, batch effect was next removed using the COMBAT algorithm44. t-SNE was applied on the batch-corrected data and visualized using a scatter plot. Analysis was completed using the “Rtsne” R package version 0.13. For the comparison of transcriptional variation across the nine core Connectivity Map cell lines, the collection of untreated profiles generated with the L1000 assay was used. Five profiles from each cell line were randomly chosen, and the expression variance of the 978 L1000 “landmark” genes was calculated for each cell line. For the comparison of L1000 gene expression data to the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles, RNAseq and Affymetrix gene expression profiles were downloaded from the CCLE website (https://portals.broadinstitute.org/ccle/data). Data within each platform were processed using invariant set scaling, which adjusts profiles according the expression of 80 “invariant” genes, followed by quantile normalization16. The ranked gene expression order of the 978 “landmark” genes was compared using Spearman’s correlation.
Chemical screening
MCF7 strains were tested against a small molecule Informer Set library of 321 anti-cancer compounds, assembled by the Cancer Target Discovery and Development (CTD2) (https://ocg.cancer.gov/programs/ctd2/data-portal), using the same principles as those described in the Cancer Therapeutics Response Portal8,45. The list of screened compounds is detailed in Supplementary Table 26. Cells were seeded in in their culture media in white, 384 well plates (Corning #3570) at an initial density of 2,500 cells per well and incubated overnight at 37°c, 5% CO2. The next day, 25nL (for primary screen) or 100nL (for confirmation dose response screen) of compound stocks in DMSO were added by pin transfer. Plates were incubated for 72 hours, cooled at RT for 10 minutes, and viability was measured using the CellTiter-Glo luminescent cell viability assay (Promega), according the manufacturer’s protocol. After 10 minutes of incubation, luminescence was read on a Perkin Elmer Envision reader, at a speed of 0.1s/well.
Chemical screening data analysis
Data were analyzed in Genedata Screener version 13.0, using the normalization method “neutral controls”, where the median of 32 DMSO wells on each plate was set to 0% activity and 0 raw signal was set to −100%. Positive controls (20μM MG-132 or 20μM bortezomib) were included on all plates (16 wells each) but were not used for normalization due to variability of response across cell lines. Dose response curves were fit using the “Smart Fit” strategy in Genedata. The % effect was defined as the high-concentration asymptote (Sinf) and qEC50 was the concentration at which the fitted curve crossed the inhibitory value representing half the maximal % effect. In addition, parameters were calculated at which the curve crossed absolute inhibitory values of 30% or 50% regardless of maximal effect (AbsEC30 and AbsEC50, respectively). AUC calculations were performed as previously described8: curves were fit with nonlinear sigmoid functions, forcing the low concentration asymptote to 1 using a 3-parameter sigmoidal curve fit. The AUC for each compound-strain pair was calculated by numerically integrating under the 8-point concentration-response curve. For visualization purposes, drug response curves were fit with a 4-parameter log-logistic function, based on normalized viability data from which the lowest dose viability had been subtracted. Plots were generated using the “LL.4” function in the “drc” R package (https://cran.r-project.org/web/packages/drc/). To examine a potential link between proliferation rate and differential drug response, doubling times were compared against the AUC values of the 33 differentially-active compounds.
Gene Set Enrichment Analysis (GSEA)
GSEA was performed using the 10,147 genes best inferred from the Connectivity Map linear model33, also known as the BING gene set. Samples were divided into two classes depending on the comparison being made: samples with a genetic alteration vs. samples without it; samples sensitive to a drug (>50% inhibition) vs. samples insensitive to the same drug (<20% inhibition). Differential expression was calculated using the signal-to-noise (S2N) metric46. A ranked gene list and S2N values served as the input for the GSEA pre-ranked module of Gene Set Enrichment Analysis, using the Java app version 3.0. The analysis was run using the ‘hallmark’, ‘KEGG’, ‘positional’ and ‘oncogenic’ signature collections from MsigDB. To compare between our MCF7 panel, CTD2 and GDSC, drug response were downloaded from the CTRP website (https://ocg.cancer.gov/programs/ctd2/data-portal; “v20.data.curves_post_qc” file, updated October 14th 2015) and from the GDSC website (http://www.cancerrxgene.org/downloads; “log(IC50) and AUC values” file, updated July 4th 2016). Expression profiles were downloaded from the CCLE website to match the CTD2 drug response data (https://portals.broadinstitute.org/ccle/data; “CCLE_Expression_Entrez_2012–09-29.gct”; updated October 17th 2012), and from the GDSC website to match with the GDSC drug response data (http://www.cancerrxgene.org/downloads; “RMA normalized expression data for cell lines”, updated March 2nd 2017). Expression profiles were filtered to include only the genes that belong to the L1000 BING set. GSEA compared the expression patterns of the 5 strains/cell lines with the highest AUC values for each matched drug with the 5 strains/cell lines with the lowest AUC values for that drug. As the robustness of gene expression signatures varies, this quantitative analysis was restricted to the 50 well-defined hallmark GSEA gene sets27.
Single-cell RNA sequencing
MCF7 cells were cultured as described above. For following transcriptional changes post drug treatment, MCF7-AA cells were exposed to 500nM of bortezomib (Selleckchem S1013 ) and harvested before treatment, after 12 hours of exposure (t12), after 24 hours of exposure (t48), or after 72 hours of exposure followed by drug wash and 24 hours of recovery (t72+24). Cells were washed, trypsinized, passed through a 40μM cell strainer, centrifuged at 400g, and resuspended at a concentration of 1,000 cells/μL in PBS + 0.5% BSA. Single cells were processed through the Chromium Single Cell 3′ Solution platform using the Chromium Single Cell 3’ Gel Bead, Chip and Library Kits (10X Genomics), as per the manufacturer’s protocol. Briefly, 7,000 cells were added to each channel, and were then partitioned into Gel Beads in emulsion in the Chromium instrument, where cell lysis and barcoded reverse transcription of RNA occurred, followed by amplification, shearing and 5′ adaptor and sample index attachment. Libraries were sequenced on an Illumina NextSeq 500.
Single-cell RNA sequencing data analysis
Reads were mapped to the GRCh38 human transcriptome using cell ranger 2.1.0, and transcript-per-million (TPM) was calculated for each gene in each filtered cell barcodes sample. TPM values were then divided by 10, since the complexity of single-cell libraries is estimated to be on the order of 100,000 transcripts. For each cell, we quantified the number of genes expressed and the proportion of the transcript counts derived from mitochondrial encoded genes. Cells with either <1,000 detected genes or >0.15 mitochondrial fraction were excluded from further analysis. Finally, the resulting expression matrix was filtered to remove genes detected in <3 cells. We focused on highly variable genes for downstream principal component analysis (PCA). For each dataset, we used the Seurat R package to detect variable genes based on fitting a relationship between the mean and the dispersion of each gene. We next scaled the data and regressed out UMI number and mitochondrial gene fraction to remove technical noise. The resulting scaled data were used as an input for PCA. Top significant PCs, estimated by a manual inspection of the PCA standard deviations elbow plots, were used to generate tSNE plots. For each dataset, we used Seurat47 (http://satijalab.org/seurat/) to identify genes that vary between samples. To detect differentially-active pathways, gene ontology (GO) enrichment analysis was performed with MSigDB27 (http://software.broadinstitute.org/gsea/msigdb) using the differentially expressed genes that passed the following thresholds: |log2FC|>0.25, Bonferroni corrected p-value<0.01, the gene was detected in >10% of the cells in each of the compared groups. Expression signatures for selected pathways were downloaded from MSigDB27. We evaluated the degree to which individual cells express a certain expression signature by using a procedure that takes into account the variability in signal-to-noise ratio, as previously reported48. To calculate pairwise cell distances, variable genes were detected, and the cell embedding matrix for the top significant PCs was used to calculate the Euclidean distance between every two cells within each sample.
Analysis of genome-wide CRISPR screens
CERES dependency scores49 were obtained from the Broad Institute Achilles website (https://portals.broadinstitute.org/achilles/datasets/18/download). Due to an unusually large difference in screen quality between MCF7 and KPL1, the subtle differences in dependency status between these lines were dominated by effects related to screen quality. To remove these uninteresting sources of variation, we corrected CERES gene scores by removing their first six principal components. These components were well-explained by experimental batch effects related to screen performance and pDNA pool. Corrected dependency scores <−0.5 were defined as dependencies. Genes listed as “pan_dependent” in the original dependency dataset were excluded from further analysis. For a more stringent overlap comparison, genes with CERES scores between −0.4 and −0.6 in MCF7 or KPL1 were further excluded. To implement the force directed layout, described in Extended Data Fig. 11b, the full corrected dependency matrix was reduced to its top 100 principle components and a k-means clustering algorithm was run repeatedly on cell lines. Here, k is the number of clusters, and the mean cluster size (number of cell lines) / k is a parameter similar to perplexity in tSNE, set to 6 for our data. Edges between cells were weighted according to the frequency with which they co-clustered, with edges appearing less than 30% of the time ignored. Cells were then laid out using the SFPD spring-block algorithm50. Cell line RNAseq gene expression data and RPPA protein expression data were obtained from the CCLE website (https://portals.broadinstitute.org/ccle/data). Single sample GSEA was calculated using the ssGSEA algorithm51.
Chymotrypsin-like activity
MCF7 cells were plated in triplicates in 96 well plates at a density of 20,000 cells per well. 24 hours later, chymotrypsin-like activity of the proteasome was assayed, using the Proteasome-Glo™ assay (Promega), according to manufactures protocol. The activity levels were normalized to the relative cell number that was measured using the fluorescent detection of resazurin dye reduction (544-nm excitation and 590-nm emission).
Western blots
For PSMC2 and PSMD2 immunoblotting, cells were lysed in HENG buffer (50mM Hepes-KOH pH 7.9, 150mM NaCl, 2mM EDTA pH 8.0, 20mM sodium molybdate, 0.5% Triton X-100, 5% glycerol), with protease inhibitor cocktail (Roche Diagnostics #11836153001). Protein concentration was determined by the BCA assay (Thermo-Fisher Scientific #23227), and proteins were resolved on SDS-PAGE for immunoblot analysis. Antibodies against the following human proteins were used: alpha-Tubulin (ab80779; Abcam), PSMC2 (MSS1–104; Enzo Life Sciences) and PSMD1 (C-7; Santa-Cruz). Visualization was performed using the ChemiDoc MP System (Bio-Rad), and ImageLab Software (Bio-Rad) was used to quantify relative band intensities. For ERα immunonblotting, cells were lysed with a mix of 4X protein loading buffer (Li-Cor 928–40004) and 10X NuPAGE sample reducing agent (Life Technologies NP0009). Protein concentration was normalized by cell counting, and proteins were resolved on SDS-PAGE. Antibodies against the following human proteins were used: beta-Actin (N-21; Santa Cruz), ERα (F-10; Santa Cruz). Visualization was performed using the Odyssey CLx imaging machine (Li-Cor), and Image Studio Software (Li-Cor) was used to quantify the relative intensities.
Generation and comparison of dendrograms
Dendrograms were constructed using Euclidean distances for continuous measures and Manhattan distances for discrete measures. Complete linkage hierarchical clustering was performed in all cases. The mutation status dendrogram was based on mutations with AF>0.05. The gene expression dendrogram was based on the 978 “landmark” genes directly measured by the L1000 assay. The copy number dendrograms were based on discrete calls (loss, normal or gain) assigned to each event based on its log2 copy number ratio, using a cutoff value of +/−0.1. The drug response dendrogram was based on normalized viability values. The cell morphology dendrogram was based on the full list of 1,784 cellular features measured. The barcode representation dendrogram was based on the log2 transformed number of reads, including only barcodes with >1,000 reads in at least one sample. To understand how dendrograms from different sources compared, the Fowlkes-Mallows index was used, as it could capture similarities in global clustering while ignoring within-group variance52. The “Bk” function in the “dendextend” R package was used for computations and visualizations. We compared dendrograms from different sources with k values ranging from 5 to 26. A background distribution was calculated by randomly shuffling the labels of the trees a 1,000 times, and calculating Bk values. The 95% upper quantile of the randomized distribution for each k was plotted. The maximum Bk value was used to estimate the degree of similarity between the compared pair of dendrograms.
Calculation of the distances between strains based on their genomic features
CNA distance based on LP-WGS was determined by the fraction of the genome affected by discordant CNA calls. CNA and SNV distances based on targeted sequencing were determined by Jaccard indices, defined as the number of shared events between strains (intersection) divided by the total number of evens in these strains (union). For SNVs, both the mutated gene and the exact amino acid change had to be identical to be counted as a shared event. Gene expression distances were defined as the Euclidean distances between L1000 expression profiles. Drug response distances were defined as the Euclidean distances between drug response profiles, after limiting the drug set to active drugs only (i.e., drugs that reduced the viability of at least one strain by >50%) and thresholding viability values to +/−100.
Comparisons across CCLE cell lines
Gene-level mRNA expression, copy number and mutation status data were downloaded from the CCLE website (https://portals.broadinstitute.org/ccle/data; “CCLE_Expression_Entrez_2012–09-29.gct”, updated October 17th 2012; CCLE_copynumber_byGene_2013–12-03.txt, May 27th 2014; CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct, updated February 29th 2016). The total number of point mutations and copy number changes were counted for each cell line. Chromosome arm-level events in CCLE samples were generated as described in Ben-David et al.53, and the number of arm-level events was counted for each cell line. The fraction of the genome affected by subclonal events was estimated using ABSOLUTE54. Combined CNA-SNV genomic instability scores were calculated as described in Zhang et al.55. The DNA repair gene set was derived from the Molecular Signature Databse (http://software.broadinstitute.org/gsea/msigdb), using the “DNA_Repair” GO signature56. The CIN70 gene set was derived from the publication by Carter et al.57. For each gene set, genes not expressed at all in the CCLE dataset were removed, and the remaining gene expression values were log2-transformed and scaled by subtracting the gene expression means. The signature score was defined as the sum of these scaled gene expression values.
Comparison of Broad (CCLE) and Sanger (GDSC) genomic features
Whole-exome sequencing data for 107 matched cell lines were downloaded from the Sanger Institute (http://cancer.sanger.ac.uk/cell_lines, EGA accession number: EGAD00001001039) for the GDSC cell lines, and from the GDC portal (https://portal.gdc.cancer.gov/legacy-archive) for the CCLE cell lines. For copy number analysis, For copy number analysis, the GATK4 somatic copy number variant pipeline was applied (https://gatkforums.broadinstitute.org/gatk/discussion/9143/how-to-call-somatic-copy-number-variants-using-gatk4-cnv)36,37. Gene-level copy number calls were generated by mapping genes from segment calls using the Consensus Coding Sequence database58. The gene-level values were log2 transformed, and converted to discrete values using pre-defined thresholds (+/−0.1, +/−0.3 and +/−0.5). To determine the % of discordance for each cell line, the number of discordant CNA calls between each pair of strains was divided by the total number of genes (excluding genes with a neutral copy number call in both data sets). For analysis of somatic variants, the CCLE/Sanger merged mutation calls were downloaded from the CCLE portal (https://portals.broadinstitute.org/ccle/data), and target interval list files were generated for each of the 107 matched cell lines in CCLE. Mutation calling was performed using MuTect38, with default parameters and “--force_output” enabled, to count the number of reads supporting the reference and alternate allele for each variant in each cell line. For analysis of germline variants, a common target interval list file consisting of a panel of 105,995 SNPs was generated, based on common SNVs found in 1,019 CCLE RNAseq samples, and Mutect was applied with the same parameters as described above. Comparison of allelic fractions was performed using the subset of variants with minimum depth of coverage of 10 in both Sanger and CCLE datasets and with minimum of allelic fraction of 0.1 in at least one dataset. Out of the 107 cell lines, one cell line (Dov13) lacked any germline concordance and was thus excluded from all analyses.
Cytogenetic analysis
Karyotyping was performed by KaryoLogic,Inc. (www.karyologic.com/) on 50 G-banded metaphase spreads per sample. Every spread displayed multiple chromosomal rearrangements with many marker chromosomes. A marker was defined as “a structurally abnormal chromosome that cannot be unambiguously identified by conventional banding cytogenetics.” The analysis was performed according to the International System for Human Cytogenetic Nomenclature (ISCN) 2016 guidelines. Rare metaphases with >100 chromosomes were excluded from further analysis.
E-karyotyping analysis
RNAseq data from non-manipulated/non-treated samples of the near-diploid human cell line RPE1 were downloaded from the NCBI SRA website (https://www.ncbi.nlm.nih.gov/sra). STAR– paired aligner was used to align paired-end samples, and STAR –non-paired aligner was used to align the non-paired samples59. The STAR to RSEM tool60 was used to generate the gene-level expression values using the gtex pipeline (https://github.com/broadinstitute/gtex-pipeline). To infer arm-level copy-number changes from gene expression profiles, the RSEM values were subjected to the e-karyotyping method61. Briefly, RSEM values were log2-converted, genes that were not expressed (log2RSEM<1) in >20% of the samples were excluded, and expression levels of the remaining genes were floored to RSEM=1. The median expression value of each gene across all samples was subtracted from the expression value of that gene, in order to obtain comparative values. The 10% most variable genes were removed from the dataset to reduce transcriptional noise. The relative gene expression data were then subjected to a CGH-PCF analysis, with a stringent set of parameters: Least allowed deviation = 0.5; Least allowed aberration size = 30; Winsorize at quantile = 0.001; Penalty = 18; Threshold = 0.01. CNAs exceeding 80% of the length of a chromosome arm were called arm-level CNAs.
Comparison of arm-level CNAs between cell line propagation and tumor progression
Recurrence of chromosome arm-level CNAs during breast cancer progression was determined by their frequency in TCGA samples, as previously described53. Recurrence of chromosome arm-level CNAs during cell line propagation was determined by comparing the arm-level calls of the strains directly separated by extensive passaging (strain D vs. strain L vs. strain AA, strain B vs. strains I/P), as shown in Extended Data Fig. 2a. Only arms that are recurrently gained or lost (but not both) in TCGA (q-value<0.05), and that have variable copy number status across the MCF7 panel, were considered for the comparison.
Statistical analysis
The significance of the difference between genomic instability associated with different sources of genetic variation, and that of the difference in chromosome number between two time points of single cell-derived clones, were determined using the two-tailed Wilcoxon rank-sum test. The significance of the difference in the euclidean distance between compounds that work through the same MoA and compounds that work through different MoA’s, that of the difference in the discordance of non-silent SNVs at different stages of transformation, that of the difference in CIN70 and wGII scores between cell lines derived from primary tumors and those derived from metastases, that of the difference between the somatic and germline SNV Pearson correlations of the Broad-Sanger cell lines, and that of the differences in the Broad-Sanger somatic SNV concordance between MSI and MSS cell lines and between primary-derived and metastasis-derived cell lines, were determined using the one-tailed Wilcoxon rank-sum test. The significance of the difference in mutation cellular prevalence across strains was determined by a Kruskal-Wallis test. The significance of the difference in AKT inhibitor IV sensitivity between PTEN-wt and PTEN-mut strains, that of the difference in the relative growth effect of ER-depletion between ESR1-loss and no-ESR1-loss strains, that of the difference in proteasome activity between bortezomib-sensitive and bortezomib-insensitive strains, that of the difference in ERα protein expression levels between strains, and that of the difference in the number of arm-level CNAs between matched early-late MCF7 strains were determined using the one-tailed Student’s t-test. The significance of the difference in doubling times, and that of the difference in sensitivity to estrogen depletion, was determined using the two-tailed Student’s t-test. The significance of the correlation between the two replicates of the primary screen was determined using Pearson’s correlation. The significance of the correlation between doubling time and the number of protein coding mutations, that of the correlation between doubling time and the fraction of subclonal mutations, that of the correlation between doubling time and drug response, were determined using a Spearman’s correlation, excluding the broadly resistant strains Q and M. The significance of the correlation between ESR1 CERES dependency scores and estrogen signaling, and that of the correlation between GATA3 CERES dependency scores and GATA3 protein expression levels were determined using a Spearman’s correlation. The deviation of the doubling time-drug response correlations from a hypothetical mean value of 0 was determined using a two-tailed one-sample t-test. The significance of the difference between the emergence and disappearance of recurrent arm-level CNAs during cell line propagation was determined using McNemar’s test. The significance of the correlation between the primary and secondary drug screens was determined using a Spearman’s correlation (including only compounds that were active in both screens). The significance of the directionality of drug-pathway association, and the likelihood that a mutation would be clonal given the number of reads that detected it, were determined using a binomial test. The significance between the fraction of pathways correctly identified between the MCF7 panel, CTD2 and GDSC, was determined using a two-tailed Fisher’s exact test. GSEA p-values and FDR-corrected q-values are shown as provided by the default analysis output. For the comparison of pathway prediction shown in Supplementary Table 32, FDR q-values were re-calculated using only the pre-selected pathways. Thresholds for significant associations were determined as: p<0.05; q<0.25. The significance of the difference in the karyotypic variation between parental and single cell-clone derived cultures was determined using the Levene’s test. The significance of differentially-expressed genes in the single-cell RNAseq data was determined by an analysis of variance (ANOVA) followed by a Games-Howell post hoc test and a Bonferroni correction. Box plots show the median, 25th and 75th percentiles, lower whiskers show data within 25th percentile −1.5 times the IQR, upper whiskers show data within 75th percentile +1.5 times the IQR, and circles show the actual data points. Statistical tests were performed using the R statistical software (http://www.r-project.org/), and the box plots and violin plots were generated using the “boxplot” and “vioplot” R packages, respectively.
Code availability
The code used to generate and/or analyze the data during the current study are publicly available, or available from the authors upon request.
Data availability
The datasets generated during and/or analyzed during the current study are available within the article, its supplementary information files, or available from the authors upon request. DNA sequencing data were deposited to SRA with the BioProject ID PRJNA398960. Single-cell RNA sequencing data were deposited to the Gene Expression Omnibus (GEO, accession number GSE114462). Source Data of all immunostaining blots are available in the online version of this paper.
URLs
The cell divergence portal is accessible at: https://cellstrainer.broadinstitute.org.
Extended Data
Extended Data Table 1: Implications of this study for the use of cell lines in cancer research.
Findings | Implications | Recommendations |
---|---|---|
Given two strains, ~20% of mutations would be observed in only one of them | There is ~10% likelihood that a mutation observed in a strain would not appear in a database of cell line genomic features | • Be cautious when using published datasets of genomic features as “lookup tables” |
Prolonged passaging introduces more variation than multiple freeze-thaw cycles | For most cell lines, freezing and thawing is likely to be associated with fewer changes than maintaining in culture | • Keep track of passage number • Use passage-matched controls • For large-scale screens, prepare multiple frozen vials for downstream analyses |
Various genomic, transcriptomic and phenotypic assays yield highly similar clustering trees | Simple and inexpensive genome-wide assays can serve as a proxy for diversification | • Use inexpensive genome-wide assays (e.g., LP-WGS) and compare to published references using Cell STRAINER: https://cellstrainer.broadinstitute.org • Exclude strains that show extreme diversification |
Genetic manipulations that are considered “neutral” can introduce genetic variation | Cell lines with fluorescent reporters, DNA barcodes or Cas9 expression are not identical to their parental cell lines | • Use efficient infection methods to reduce the bottleneck associated with antibiotic selection • Characterize manipulated strains to ensure they retain hallmark genomic features • In CRISPR screens, correct for copy number effects using the copy number landscape of the screened strain |
Genetic and transcriptomic variation may affect drug response | Inconsistencies in drug response studies may be attributed to genetic and transcriptomic variability | • Genetic and transcriptomic distances should be considered when comparing drug response data • Compare drug response data to genomic data from the same strain |
Transcriptional differences between sensitive and resistant strains can elucidate compound mechanism of action | • Use characterized isogenic-like strains to uncover associations between molecular features and drug response | |
Pre-existing heterogeneity within culture underlies cell line instability | Single cell-derived clones differ from one another genetically, transcriptionally and phenotypically | • Confirm the genomic features of single cell-derived clones • Avoid comparisons between bottlenecked cell populations, whenever possible |
Subtle differences in culture conditions can lead to changes in cell line clonal composition | • Keep culture conditions constant | |
Heterogeneity keeps emerging in culture due to ongoing genomic instability | Prolonged passaging of single cell-derived clones can lead to their diversification Cell lines with deficient maintenance of genome integrity (e.g., MSI or TP53-mutant) are more prone to genomic evolution | • Re-confirm genomic features of single cell-derived clones following prolonged passaging • Apply these recommendations more stringently to genomically unstable cell lines |
Supplementary Material
Acknowledgements
The authors thank Guo Wei, Weiqun Zhang, Chris Mader, Jennifer Roth, David Thomas, Andrew Kung, Desiree Davison, Candace Chouinard, Katherine Walsh, Adrija Navarro, Alice Berger, Douglas Wheeler, Xin Jin, Andrew Hong, Marianna Trakala, Patricia Cho, Johan Kuiken, Romain Boidot, Xiaodong Lu, Franklin Huang, Cory Johannessen, Xaiyun Wu, Stefano Santaguida, Nilay Sethi, Angelika Amon, Kornelia Polyak, Joan Brugge, Dihua Yu, Juha Klefstrom, William Hahn, Ian Dunn and Yu Mei for contributing cell lines for this study; Matthew Ducar and Samantha Drinan for assistance with the OncoPanel assay; Zach Herbert for assistance with the whole-genome sequencing; John Davis, Sarah Johnson, David Lahr, Josh Gould, Max Macaluso, Xiaodong Lu and Ted Natoli for assistance with the L1000 assay; David Feldman for assistance with the barcoding experiment. James McFarland, Juliann Shih, Coyin Oh, Andrew Cherniack and Paul Clemons for computational assistance; Victor Jones and Jennifer Gale for assistance with drug screening; Kate Hartland for assistance with Cell Painting staining and imaging; Aviv Regev, Orit Rozenblatt-Rosen, Anna Neumann, Daniella Dionne and Lan Nguyen for assistance with 10X single-cell RNAseq; Elizabeth Gonzalez for assistance with cytogenetic analysis; Lee Lichtenstein, David Benjamin, Samuel K. Lee, Valentin Ruano-Rubio and Aaron Chevalier for the GATK CNV pipeline; Zuzana Tothova, Jesse Boehm, Ofir Cohen, Cory Johannessen, Aravind Subramanian, Anne Carpenter, Itay Tirosh, Yehuda Brody, Yotam Zeira and Romanos Pistofidis for discussions. U.B.-D. is supported by a HFSP postdoctoral fellowship. This work was supported by HHMI (T.R.G.), NIH (R01 CA188228; R.B.), Gray Matters Brain Cancer Foundation (R.B.), Bridge Project (P.B. and R.B.), Broad Institue SPARC award (D.F., P.B., and R.B.), and BroadNext10 trainee grant (U.B.-D.).
Footnotes
Competing financial interests
The authors declare no competing financial interests.
References
- 1.Sharma SV, Haber DA & Settleman J Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat Rev Cancer 10, 241–53 (2010). [DOI] [PubMed] [Google Scholar]
- 2.Freedman LP, Cockburn IM & Simcoe TS The Economics of Reproducibility in Preclinical Research. PLoS Biol 13, e1002165 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Prinz F, Schlange T & Asadullah K Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 10, 712 (2011). [DOI] [PubMed] [Google Scholar]
- 4.Barretina J et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–7 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Garnett MJ et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–5 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Basu A et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell 154, 1151–61 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang W et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res 41, D955–61 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Seashore-Ludlow B et al. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset. Cancer Discov 5, 1210–23 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Haibe-Kains B et al. Inconsistency in large pharmacogenomic studies. Nature 504, 389–93 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cancer Cell Line Encyclopedia, C. & Genomics of Drug Sensitivity in Cancer, C. Pharmacogenomic agreement between two cancer cell line data sets. Nature 528, 84–7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Haverty PM et al. Reproducible pharmacogenomic profiling of cancer cell line panels. Nature 533, 333–7 (2016). [DOI] [PubMed] [Google Scholar]
- 12.Soule HD, Vazguez J, Long A, Albert S & Brennan M A human cell line from a pleural effusion derived from a breast carcinoma. J Natl Cancer Inst 51, 1409–16 (1973). [DOI] [PubMed] [Google Scholar]
- 13.Brooks SC, Locke ER & Soule HD Estrogen receptor in a human cell line (MCF-7) from breast carcinoma. J Biol Chem 248, 6251–3 (1973). [PubMed] [Google Scholar]
- 14.Lee AV, Oesterreich S & Davidson NE MCF-7 cells--changing the course of breast cancer research and care for 45 years. J Natl Cancer Inst 107(2015). [DOI] [PubMed] [Google Scholar]
- 15.Bamford S et al. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer 91, 355–8 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Subramanian A et al. A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles. bioRxiv (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Roth A et al. PyClone: statistical inference of clonal population structure in cancer. Nat Methods 11, 396–8 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Eirew P et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature 518, 422–6 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bhang HE et al. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat Med 21, 440–8 (2015). [DOI] [PubMed] [Google Scholar]
- 20.Berger AH et al. High-throughput Phenotyping of Lung Cancer Somatic Mutations. Cancer Cell 30, 214–228 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Peck D et al. A method for high-throughput gene expression signature analysis. Genome Biol 7, R61 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gupta PB et al. Stochastic state transitions give rise to phenotypic equilibrium in populations of cancer cells. Cell 146, 633–44 (2011). [DOI] [PubMed] [Google Scholar]
- 23.Lieber M, Smith B, Szakal A, Nelson-Rees W & Todaro G A continuous tumor-cell line from a human lung carcinoma with properties of type II alveolar epithelial cells. Int J Cancer 17, 62–70 (1976). [DOI] [PubMed] [Google Scholar]
- 24.Cancer Genome Atlas Research, N. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–50 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Soule HD et al. Isolation and characterization of a spontaneously immortalized human breast epithelial cell line, MCF-10. Cancer Res 50, 6075–86 (1990). [PubMed] [Google Scholar]
- 26.Bray MA et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc 11, 1757–74 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Liberzon A et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst 1, 417–425 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Janiszewska M et al. In situ single-cell analysis identifies heterogeneity for PIK3CA mutation and HER2 amplification in HER2-positive breast cancer. Nat Genet 47, 1212–9 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Venteicher AS et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science 355(2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
Additional Methods References
- 30.Bray MA, Fraser AN, Hasaka TP & Carpenter AE Workflow and metrics for image quality control in large-scale high-content screens. J Biomol Screen 17, 266–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dao D et al. CellProfiler Analyst: interactive data exploration, analysis and classification of large biological image sets. Bioinformatics 32, 3210–3212 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Adalsteinsson VA et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat Commun 8, 1324 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ha G et al. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res 22, 1995–2007 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sholl LM et al. Institutional implementation of clinical tumor profiling on an unselected cancer population. JCI Insight 1, e87062 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Li H & Durbin R Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–60 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.McKenna A et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–303 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.DePristo MA et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–8 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Cibulskis K et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–9 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.McLaren W et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–70 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Olshen AB, Venkatraman ES, Lucito R & Wigler M Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–72 (2004). [DOI] [PubMed] [Google Scholar]
- 41.Abo RP et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 43, e19 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Sanjana NE, Shalem O & Zhang F Improved vectors and genome-wide libraries for CRISPR screening. Nat Methods 11, 783–784 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Joung J et al. Genome-scale CRISPR-Cas9 knockout and transcriptional activation screening. Nat Protoc 12, 828–863 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Johnson WE, Li C & Rabinovic A Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–27 (2007). [DOI] [PubMed] [Google Scholar]
- 45.Rees MG et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat Chem Biol 12, 109–16 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Golub TR et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–7 (1999). [DOI] [PubMed] [Google Scholar]
- 47.Macosko EZ et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Tirosh I et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–96 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Meyers RM et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat Genet 49, 1779–1784 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hu Y Efficient, High-Quality Force-Directed Graph Drawing. The Mathematica Journal 10, 37–71 (2006). [Google Scholar]
- 51.Barbie DA et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108–12 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Fowlkes EB & Mallows CL A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association 78, 553–569 (1983). [Google Scholar]
- 53.Ben-David U et al. Patient-derived xenografts undergo mouse-specific tumor evolution. Nat Genet 49, 1567–1575 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Carter SL et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30, 413–21 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zhang S, Yuan Y & Hao D A genomic instability score in discriminating nonequivalent outcomes of BRCA½ mutations and in predicting outcomes of ovarian cancer treated with platinum-based chemotherapy. PLoS One 9, e113169 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545–50 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Carter SL, Eklund AC, Kohane IS, Harris LN & Szallasi Z A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet 38, 1043–8 (2006). [DOI] [PubMed] [Google Scholar]
- 58.Pujar S et al. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res 46, D221–D228 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Dobin A et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Li B, Ruotti V, Stewart RM, Thomson JA & Dewey CN RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Ben-David U, Mayshar Y & Benvenisty N Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. Nat Protoc 8, 989–97 (2013). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated during and/or analyzed during the current study are available within the article, its supplementary information files, or available from the authors upon request. DNA sequencing data were deposited to SRA with the BioProject ID PRJNA398960. Single-cell RNA sequencing data were deposited to the Gene Expression Omnibus (GEO, accession number GSE114462). Source Data of all immunostaining blots are available in the online version of this paper.