SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing

Rujin Wang; Dan-Yu Lin; Yuchao Jiang

doi:10.1016/j.cels.2020.03.005

. Author manuscript; available in PMC: 2021 May 20.

Published in final edited form as: Cell Syst. 2020 May 20;10(5):445–452.e6. doi: 10.1016/j.cels.2020.03.005

SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing

Rujin Wang ¹, Dan-Yu Lin ^1,², Yuchao Jiang ^1,^2,^3,^*

PMCID: PMC7250054 NIHMSID: NIHMS1581415 PMID: 32437686

SUMMARY

Whole genome single-cell DNA sequencing (scDNA-seq) enables characterization of copy number profiles at the cellular level. We propose SCOPE, a normalization and copy number estimation method for the noisy scDNA-seq data. SCOPE’s main features include: (i) a Poisson latent factor model for normalization, which borrows information across cells and regions to estimate bias, using in silico identified negative control cells; (ii) an expectation-maximization algorithm embedded in the normalization step, which accounts for the aberrant copy number changes and allows direct ploidy estimation without the need for post hoc adjustment; and (iii) a cross-sample segmentation procedure to identify breakpoints that are shared across cells with the same genetic background. We evaluate SCOPE on a diverse set of scDNA-seq data in cancer genomics and show that SCOPE offers accurate copy number estimates and successfully reconstructs subclonal structure. A record of this paper’s Transparent Peer Review process is included in the Supplemental Information.

Keywords: single-cell DNA sequencing, copy number variation, normalization, tumor heterogeneity, cancer genomics

eTOC Blurb

Single-cell DNA sequencing holds great promise for deciphering tumor heterogeneity. The data is, however, extremely sparse and noisy due to the shallow depth of coverage and the biases and artifacts that are introduced. Here, we propose SCOPE, a statistical framework for data normalization and copy number estimation by scDNA-seq. We show that SCOPE offers clean normalization and accurate copy number estimation, which lead to successful reconstruction of cancer subclonal structures.

Graphical Abstract

graphic file with name nihms-1581415-f0001.jpg

INTRODUCTION

Copy number variations (CNVs) are an abundant source of variations (Sudmant et al., 2015) and have been associated with diseases (McCarroll and Altshuler, 2007). In cancer, somatic CNVs, also referred to as copy number aberrations (CNAs), are prevailing and have been associated with cancer progression and metastases (Jiang et al., 2016; Urrutia et al., 2018). Recent advances in next-generation sequencing enable genome-wide CNV detection in a high-throughput manner. Yet profiling of somatic CNVs in cancer remains challenging due to the heterogeneous nature of the tumor samples—the observed copy number signals from bulk DNA sequencing (DNA-seq) are averaged across multiple genotypically distinct cancer subpopulations (Chen et al., 2017). Tumor purity further dampens the CNV signals, and overall tumor ploidy leads to genome-wide gains or losses, all of which make statistical estimation and inference less tractable (Carter et al., 2012).

Whole genome single-cell DNA-seq (scDNA-seq) enables the characterization of cellular copy number profiles without the cell subpopulation confounding. This circumvents the averaging effects associated with bulk-tissue DNA-seq and decreases ambiguity in elucidating cancer evolutionary history (see Table S1 for a summary of existing studies and Table S2 for details of the existing whole-genome amplification methods (Baslan et al., 2012; Dean et al., 2002; Zong et al., 2012)). ScDNA-seq data across all existing platforms and technologies is, however, sparse and noisy. The extremely shallow and highly non-uniform depth of coverage, which is caused by nonlinear amplification and dropout events during the library preparation and sequencing step (Liu et al., 2017; Navin, 2014), makes detecting CNVs by scDNA-seq challenging.

In addition, the cancer genomes undergo large chromosome or chromosome- arm level deletions or duplications, as well as changes in cellular ploidy, both of which lead to recurrent and frequent CNAs that disrupt a large proportion of the genome (Shlien and Malkin, 2009). Existing methods for scDNA-seq, including Ginkgo (Garvin et al., 2015) and SCNV (Wang et al., 2018) assume that all reads are from the diploid regions, and both control for the GC biases by modeling the unimodal relationship between GC content and (log-scaled) read counts. This ignores the multiplicative effects contributed by the different sequential and discretized copy-number states. Furthermore, because depth of coverage within each cell is normalized against its own coverage baseline, each cell has a normalized value with mean zero and thus completely masks ploidy. To solve this issue, both Ginkgo and SCNV adopt a post hoc ploidy estimation procedure to scale the normalized results by the estimated ploidy to reflect the absolute copy numbers. However, the copy numbers and the ploidy, the latter of which is the genome-wide average of the copy numbers, are interrelated—the absolute copy numbers depend on the true underlying ploidy, and the two-step approach estimates ploidy based on the normalization results.

In the existing methods mentioned above, the recurrent CNV signals, as well as the complicating factors such as ploidy, can either be accidentally removed during the normalization step, requiring that they be recovered in a second step, or can bias the correction for known biases. HMMcopy, which was initially developed for bulk whole-genome sequencing data (Ha et al., 2012), was recently applied to scDNA-seq data with adaptations (Laks et al., 2018). Instead of applying the default LOESS regression to reduce GC content bias, it adopts a modal regression algorithm that normalizes bin counts to integer values, as expected of single-cell profiles. However, as we show later through benchmark analysis, the enhanced version of HMMcopy suffers from low stability—a finding that concords with another recent benchmark report (Fan et al., 2019).

After proper data normalization, segmentation is performed to return regions with homogeneous copy number profiles. Existing methods adopt segmentation procedures based on either circular binary segmentation (CBS) or hidden Markov model (HMM) sand they segment each cell separately with or without a composite control (Garvin et al., 2015; Wang et al., 2018). HMMcopy (Laks et al., 2018) adapts the HMM-based segmentation algorithm to the single-cell setting, with a penalty term for non-integer copy numbers. All of these methods, however, lack the ability to construct CNV profiles by integrating shared cellular breakpoints across samples. This is extremely important in single-cell cancer genomics, where multiple cells from the same subclone share the same breakpoints.

To meet the widespread demand for CNV detection with single-cell resolution, we propose a statistical and computational framework, SCOPE, for Single-cell COPy number Estimation. The distinguishing features of SCOPE include: (i) utilization of cell-specific Gini coefficients for quality control and for identification of normal/diploid cells, which are then used as negative control samples in a Poisson latent factor model for read depth normalization; (ii) modeling of GC content bias using an expectation-maximization (EM) algorithm embedded in the Poisson regression models to account for the discretized copy number states along the genome; and (iii) a cross-sample iterative segmentation procedure to identify breakpoints that are shared across cells with the same genetic background.

We evaluate the performance of SCOPE on real scDNA-seq datasets from several cancer genomic studies. Compared to existing methods, SCOPE is shown to more accurately estimate subclonal copy number aberrations and to have higher correlation with array-based copy number profiles of purified bulk samples. We show that the copy number profiles by scDNA-seq are also well recapitulated, although at low resolution, by whole-exome sequencing (WES) and single-cell RNA-sequencing (scRNA-seq). We finally demonstrate SCOPE on the recently released scDNA-seq data that was produced using the 10X Genomics single-cell CNV pipeline, showing that it can reliably recover 1% of the cancer cell spike-ins from a background of normal cells and successfully reconstruct cancer subclonal structure across 10,000 breast cancer cells. SCOPE is compiled as an open-source Bioconductor R package available at https://bioconductor.org/packaqes/SCOPE/.

RESULTS

An overview of the SCOPE workflow is shown in Figure S1. SCOPE takes as input the mapped reads from assembled sequencing files, which are pre-processed using the same bioinformatic pipeline. SCOPE then generates consecutive bins along the genome and computes the cell-by-bin read depth matrix, as well as the mappability and GC content for each bin. See STAR Methods for details. For data normalization, SCOPE adopts a Poisson latent factor model with an embedded EM algorithm to capture both cell- and bin-specific biases; for segmentation, SCOPE incorporates a cross-sample iterative procedure, enabling shared breakpoints across cells.

For illustration, Figure 1 shows fitting of GC content bias in two ways: (i) assuming all bins are from the null region and fitting a non-parametric function using read depths across all bins—the method that is used by other existing methods, and (ii) adopting an EM algorithm with the missing data being the carrier status for each bin, which is a simplified implementation of SCOPE. Three cells are chosen as examples, one diploid, one hypodiploid, and one hyperdiploid. For the diploid cell, the fitting by SCOPE is the same as that by the all-null fitting, as expected. For hypodiploid and hyperdiploid cells, however, sequential multiplicative increments in the GC content biases are observed due to prominent copy number changes along the genome. The “all-null” fit is biased by the global genomics structures due to deletions and duplications, but with the embedded EM algorithm, SCOPE is able to correctly estimate the GC bias term and automatically returns the estimated copy numbers and ploidy, completely off the shelf.

Analysis of scDNA-seq Data of Breast Cancer Patients with aCGH for Validation

We first demonstrated SCOPE on a scDNA-seq dataset of 200 flow-sorted single cells from two breast cancer patients, T10 and T16 (Navin et al., 2011). Fluorescence activated cell sorting (FACS) of the single cells showed different distributions of ploidy: hyperdiploid, hypodiploid, and diploid. In patient T10, FACS suggested three distinct clonal subpopulations, indicating a polygenomic tumor. In patient T16, a relatively homogeneous cancer cell population was identified in both primary tumor and metastasis, indicating a monogenomic tumor. The histograms in Figure 2A show the normalization results for scDNA-seq data of the polygenomic tumor T10 across four methods: Ginkgo (Garvin et al., 2015), Ginkgo with post hoc adjustment by the estimated ploidy, HMMcopy (Laks et al., 2018), and SCOPE. While Ginkgo’s post hoc ploidy adjustment procedure with cell-specific scaling achieves better separation of the CNV signals, SCOPE returns copy number states that are centered at the expected integer copy-number values and does so completely off the shelf. Compared to HMMcopy, SCOPE also achieves higher signal-to-noise ratio.

Figure 2B and Figure S2 give heatmaps of the estimated copy numbers across all cells. For T10, SCOPE identified two subpopulations of hyperdiploid cancer cells, one subpopulation of hypodiploid cancer cells, and a normal cell subpopulation, which is consistent with the previous report. For T16, SCOPE returned two cancer cell subclones, one from the primary tumor and the other from the metastasis. Upon careful inspection, we find the copy number profiles of the two hyperdiploid subpopulations highly similar, indicating that the same subclone from the primary tumor led to relapse.

To further assess the performance of SCOPE and to benchmark against existing methods, we adopted CNV calls from aCGH of purified bulk samples (Navin et al., 2010) from the same patient as gold standards. Notably, array-based intensity measurements were normalized against a sample-specific baseline, producing copy number signals that have a mean of one for all three subpopulations. Therefore, the relative copy numbers (i.e., absolute copy numbers divided by overall ploidy) were used by SCOPE, Ginkgo, and HMMcopy for comparison against aCGH calls (Figure S3). Using spearman correlation and root mean squared error (RMSE) as performance metrics, SCOPE was shown to outperform the other two methods by the aCGH gold standards (Figure 2C). Should we consider the ploidy and directly evaluate the discretized copy number estimates, the improvement of SCOPE over existing methods is clearer, as shown in Figure 2A. Additionally, we performed benchmark analysis to assess breakpoint detection accuracy by using CBS-segmented aCGH signals as gold standard. SCOPE’s cross-sample segmentation procedure achieves the highest joint F- measures of the precision and recall rates (Figure S4).

Analysis of scDNA-seq Data of Triple-Negative Breast Cancer Patients with Paired WES and scRNA-seq

We further applied SCOPE to scDNA-seq data of temporally separated tumor resections from triple negative breast cancer patients (Kim et al., 2018). We applied SCOPE to three “clonal extinction” patients, where tumor cells were previously reported to exist only in the pre-treatment samples (Table S1). In patient KTN302, where 92 cells were sequenced at two timepoints, pre- and mid-treatment, SCOPE successfully detected two subclones of aneuploid cells in the pre-treatment tumors and found that all the cells from the mid-treatment group were normal (Figure 3A). A large number of CNAs are shared between the two inferred clones; this is likely due to a punctuated copy number evolution (Gao et al., 2016), which produces the majority of CNAs in the early stages of tumor evolution. Results for patients KTN126 and KTN129 are included in Figure S5.

Figure 3. — (A) Inferred copy-number profiles by SCOPE with cells clustered by hierarchical clustering (N for normal cell, A, B, and C for cancer subclones). Lower two panels show copy number profiles inferred by WES and scRNA-seq. (B) Orthogonal validations by WES and scRNA-seq. The estimated relative copy numbers by bulk WES and scRNA-seq are higher in amplified regions and lower in deleted regions, compared to the null regions. (C) For scRNA-seq and bulk WES, the relative copy numbers are estimated comparing to a sample-specific baseline and have mean one. This masks and over/under normalizes ploidy in hyper/hypodiploid cells.

To validate the copy number profiles returned by SCOPE, we used bulk-tissue WES and scRNA-seq data, performed on the matched normal (blood), pre-treatment, mid-treatment, and post-treatment bulk samples (Table S1). For bulk WES, seqCBS (Shen and Zhang, 2012) was used for the paired tumor-normal setting, which returns relative copy numbers (i.e., the ratios of cancer cell copy number to normal cell copy number). For scRNA-seq, we used InferCNV (Patel et al., 2014) and employed a sliding-window approach of 50 genes (Tirosh et al., 2016) to infer CNVs, using as input the transcript per million matrix returned by SALMON (Patro et al., 2017). To quantitatively assay the quality of the call set produced by SCOPE, we plotted the relative copy number estimates by WES and scRNA-seq for the inferred amplified, deleted, and copy-number-neutral regions by SCOPE. Although the copy number signals from WES and scRNA-seq are of low resolution and are attenuated due to normal cell contaminations, the relative copy numbers (i.e., absolute copy numbers divided by ploidy) from both WES and scRNA-seq are less than one for deletions, but greater than one for amplifications (Figure 3B).

While we demonstrated that the deletion and duplication signals by SCOPE using scDNA-seq can be recapitulated from two orthogonal platforms, we also found that, in the WES and scRNA-seq data, the relative copy numbers in the null regions are less than one and that the signals for duplications are dampened towards the null. This observation indicates a potential pitfall associated with profiling CNVs by bulk DNA-seq and scRNA-seq: all copy number events are defined in reference to the population average, resulting in a genome-wide mean of one for the relative copy number estimates (Figure 3C). This masks the ploidy and further results in over-normalization of ploidy in hyperdiploid samples and under-normalization of ploidy in hypodiploid samples. SCOPE solves this problem by directly estimating discretized copy numbers of integer values and is free of whole-genome bias due to ploidy.

Additionally, we adopted MuTect2 (Cibulskis et al., 2013) to profile somatic point mutations by bulk WES, followed by functional annotations by dbSNP (Sherry et al., 2001), COSMIC (Forbes et al., 2017), and Annovar (Wang et al., 2010; Yang and Wang, 2015) and stringent quality control procedures. PyClone (Roth et al., 2014) was further adopted, taking as input the inferred mutant allele frequencies by MuTect2 (Cibulskis et al., 2013), the total copy number estimates by seqCBS (Shen and Zhang, 2012), and the tumor purity estimates by ThetA (Oesper et al., 2013). The mutant cell frequencies returned by PyClone using bulk WES suggested a single clone across all three patients (Figure S6), while two to three subclones were identified by scDNA-seq (Figure 3A, Figure S5).

Analysis of scDNA-seq Data of Gastric Cancer Spike-ins and Breast Cancer Dissections from the 10X Genomics

We finally demonstrated SCOPE on three recently released scDNA-seq datasets from the 10X Genomics Single-Cell CNV Solution pipeline. As a proof of concept, we started by applying SCOPE to two spike-in datasets, where 1% and 10% MKN-45 gastric cancer cell lines are mixed with ~500 and ~1000 normal BJ fibroblast cells, respectively (Table S1). Despite the higher sparsity and lower sequencing depth of this data, SCOPE successfully identified the cancer cell cluster from the cluster of normal diploid cells (Figure 4A). Specifically, 11 and 34 cancer cells (shown in red in Figure 4A) were identified from the 1% and 10% spike-in datasets respectively, with estimated proportions as 1% and 7%. Notably, while not all diploid cells are used as negative controls by a Gini threshold of 0.12, there is a perfect separation between normal cells and cancer cells (see “Performance assessment via spike-in studies and with varying parameters” for details on setting the threshold of Gini coefficients).

Figure 4. — (A) 1% and 10% of gastric cancer cell lines were mixed with normal fibroblast cell lines and sequenced by 10X Genomics. Visualization by t-SNE projections of normalized scDNA-seq data demonstrates that SCOPE successfully identified the cancer cell population from the background of normal cells. (B) *In silico* spike-in studies were conducted with different copy number states added. SCOPE outperforms the other methods with the highest joint F-measures of precision and recall rates.

We further applied SCOPE to a 10X Genomics scDNA-seq dataset of ~10,000 cells from five adjacent tumor dissections of a frozen triple negative ductal carcinoma. SCOPE was separately applied to each dissection for identification of normal cells, read depth normalization, and copy number estimation. The proportions of normal cells vary across the five dissections, with mean cellular ploidy of sections A to E being 2.08, 2.87, 3.00, 3.01, and 3.20, respectively. This indicates a gradient of normal cells contaminating the tumor (Figure S7). We further integrated the inferred copy number profiles by SCOPE across all tumor cells from the five dissections and demonstrated that SCOPE was able to identify subclonal structures (Figure S8). For example, the distinct duplication events on chromosome 3 and chromosome 4 are mutually exclusive and mark a split in the tumor evolutionary history. Hierarchical clustering based on the normalization result suggests that the different cancer subclones consist of cancer cells from all sections. An early branching evolutionary model is plausible, where copy number aberrations happen at a potentially early stage in the disease advancement (Sottoriva et al., 2015).

Performance Assessment via Spike-in Studies with Varying Parameters

To further assess performance of SCOPE and to benchmark against existing methods, we conducted in silico spike-in studies. We started with the read depth data of diploid cells from breast cancer patient T10 (Navin et al., 2011) and applied stringent filtering steps to remove any regions harboring potential CNVs, resulting in 39 normal cells and 1,545 genomic regions that are CNV-free. We then added multiple CNV signals to the background count matrix under the null. These signals had different copy number states but shared changepoints across cells. To generate these CNV signals, we scaled the raw depth of coverage spanned by the CNV from y to y × c/2, where c is sampled from a normal distribution, with mean equal to the underlying copy number and standard deviation 0.1. We ran simulations under three designs: (i) spike-ins with copy number states 1, 2, and 3; (ii) spike-ins with copy number states 2, 3, and 4; and (iii) spike-ins with copy number states 1, 2, and 4. The different copy number states had varying genome-wide incidence rates and each simulation was repeated 20 times. Overall, when compared to Ginkgo and HMMcopy, SCOPE returned the highest precision rates, recall rates, joint F-measures (i.e., geometric means of precision and recall), Matthew’s correlation coefficients, and Kappa statistics under all simulation settings (Figure 4B, Table S3). In several cases, HMMcopy returned copy number estimates that were inflated across the entire genome (Table S3A), and in concordance with a recent benchmark report (Fan et al., 2019), HMMcopy could not correctly predict the absolute copy numbers in the absence of intermediate copy numbers (Figure 4B, Table S3C).

To systematically investigate how performance is influenced by varying parameters (e.g., bin size, threshold of cell-specific Gini coefficients to identify normal controls, and the number of Poisson latent factors), we carry out additional evaluation and benchmark analysis on the scDNA-seq data from the breast cancer patient T10, with the aCGH calls as gold standard (Figure S9). We show that bin sizes between 200Kb and 1Mb do not affect the performance of SCOPE, especially for chromosome-arm level CNVs (Figure S9A). To make the optimal choice of the Gini coefficient threshold, SCOPE does not need to include all normal cells as negative controls – 10 to 20 normal cells suffice to achieve accurate estimation – and we provide empirical evidence for this claim in Figure S9B. In addition, we show that SCOPE’s performance is invariant to the choice of K, the number of Poisson latent factors (Figure S9C), since SCOPE uses only the normal cells to estimate the bin-specific noise terms. In summary, we show that SCOPE is robust to the choice of bin size, Gini coefficient threshold, and the number of Poisson latent factors.

DISCUSSION

Here we propose SCOPE, a statistical method to remove technical noise and improve CNV signal-to-noise ratio for scDNA-seq data. We demonstrate that the sample-specific bias correction procedure is inadequate to successfully and unbiasedly capture all noise terms. SCOPE instead adopts cross-sample normalization, which relies on multiple samples processed in the same experiment run and borrows information across both regions and samples to estimate the bias terms. This normalization strategy, based on matrix factorization, has been applied to different types of bulk omics data (Jiang et al., 2015; Risso et al., 2014) to adjust for GC content bias and other latent artifacts. Another refinement of this normalization strategy is found in RUV (Risso et al., 2014) and CODEX2 (Jiang et al., 2018), which extensively utilize negative control samples and/or negative control regions/genes to estimate latent factors and to increase signal-to-noise ratio. SCOPE utilizes cell-specific Gini coefficients to identify diploid cells as controls and further harmonizes an EM algorithm in the cross-sample Poisson latent factor model to directly estimate the integer-valued copy numbers.

While copy number profiling is important beyond cancer, scDNA-seq holds great promise for deciphering tumor heterogeneity. Due to the low depth of coverage, scDNA-seq has thus far been primarily used to profile somatic CNVs in cancer cells, and the resolutions of the detected breakpoints range from hundreds of kilobases to megabases. Therefore, SCOPE by default is designed for large CNV detection in cancer cells; for relatively short CNVs and for common germline CNVs, the sensitivity is low due to technological limitations. To further reduce the ambiguity in unmasking intra-tumor heterogeneity, future research may include force-calling somatic point mutations from scDNA-seq and integrating both copy number and point mutation profiles for reconstructing tumor evolutionary history. A few recently developed methods for CNV detection use scRNA-seq (Fan et al., 2018; Muller et al., 2018; Patel et al., 2014). While these gene-expression based approaches have been successfully applied to detect chromosome or chromosome-arm level CNAs, multimodal data alignment between scDNA-seq and scRNA-seq has the potential to increase the resolution in detecting CNAs in a larger cell population. This alignment could also shed light upon the interplay between genomic and transcriptomic variations in cancer, for it is still unclear how transcriptomic variation is modulated by genetic variation and phylogenetic evolution in cancer. The joint analysis framework would allow quantification of clonal differences in gene expression while accounting for DNA confounding.

SCOPE’s running time across all adopted datasets for both normalization and segmentation are shown in Table S4. It is recommended that normalization with different numbers of Poisson latent factors and segmentation across different chromosomes be run in parallel. Capacity is increasing in single-cell isolation and single-cell whole-genome sequencing, and there is an increasing need to profile CNVs at the single-cell level. We believe that SCOPE can be a useful tool for the genetics and genomics community, for it lays the statistical foundation that will make robust and accurate single-cell CNV profiling possible.

Key Changes Prompted by Reviewer Comments

To address one of the major concerns for the first reviewer, we adopted bulk-tissue WES data to infer the subclonal structures of the breast cancer patients that we analyzed using scDNA-seq. We extensively profiled somatic point mutations and copy number aberrations, inferred tumor purities, and estimated the number of subclones. We showed that a single clone was recovered by bulk-tissue WES, while two to three subclones were discovered by scDNA-seq. For the 10X Genomics data, bulk DNA-seq was not available and thus we aggregated all the single cells stratified by dissection to generate multi-sectional pseudo-bulk samples. We demonstrated that in such pseudo-bulk samples, the CNV signals were attenuated by the normal cell contaminations and biased by the averaging effects across cancer subclones. Based on the suggestion by the second reviewer, we adopted additional metrics including Matthew’s correlation and Kappa statistics in the benchmark analysis and also assessed the breakpoint detection accuracy through segmentation. We demonstrated the outperformance of SCOPE compared to the other methods. In the revision, we also reorganized and rewrote the paper for better clarity. For context, the complete Transparent Peer Review Record is included within the Supplemental Information.

STAR METHODS

LEAD CONTACT AND MATERIALS AVAILABILITY

This study did not generate new unique reagents. Further information and requests for resources should be directed to and will be fulfilled by Yuchao Jiang (yuchaoi@email.unc.edu).

METHODS DETAILS

SCOPE Model for Data Normalization

SCOPE is based on a Poisson latent factor model (Jiang et al., 2015) for count-based read depth normalization. However, it is completely standalone and is specifically adapted for the single-cell setting. To estimate CNVs by bulk DNA-seq across subjects and to estimate CNVs by scDNA-seq across cells from the same subject are two different problems – in cancer genomics, the former profiles inter-tumor heterogeneity across patients, while the latter profiles intra-tumor heterogeneity looking at single cells within a patient. SCOPE’s key innovation is its integration of both null and non-null regions in a genome wide fashion in order to produce an unbiased estimation of both GC content bias and latent factors for cell- and position-specific background correction. The Poisson mixture model for normalization allows identification of discretized and integer-valued copy numbers, as expected of single-cell profiles. Unlike the two-step approach with post hoc ploidy adjustment, SCOPE offers direct ploidy estimates based on the estimated copy numbers along the genome.

Specifically, let Y = {Y_ij;i = 1, …m;j = 1, …n} be the raw read count matrix, where Yj is the read depth for cell j ∈ {1, …,n} and bin i ∈ {1,…,m}. SCOPE assumes that

Y_{i j} \sim Poisson (λ_{i j}), λ_{i j} = N_{j} β_{i} f_{j} (G C_{i}) α_{i j} \exp (\sum_{k = 1}^{K} g_{i k} h_{j k}) .

N_j is a cell-specific library size factor, which can be globally estimated as the median ratio (Anders and Huber, 2010); β_i reflects bias due to bin-specific length, capture and amplification efficiency; f_j(GC_i) is a sample-specific non-parametric function to capture the GC content bias, g_ik and h_jk (1 ≤ k ≤ K) are the k th bin- and cell-specific latent factors with orthogonality restraints, which force identifiability; and α_ij specifies the sequential multiplicative increment due to different copy number states. α_ij is discretized to fit the single-cell setting with integer-valued copy numbers and to ensure identifiability with T_j different copy number states within a specific cell j (1 ≤ j ≤ n):

α_{i j} = {\begin{matrix} 1 / 2 & with probability π_{1}^{(j)} \\ 2 / 2 & with probability π_{2}^{(j)} \\ ⋮ & ⋮ \\ T_{j} / 2 & with probability π_{T j}^{(j)} \end{matrix},

where ${\frac{1}{2}, \frac{2}{2}, \dots, \frac{T_{j}}{2}}$ corresponds to the linear and discretized increments in read depths across all copy number states in cell j, with corresponding incident rates ${π_{1}^{(j)}, π_{2}^{(j)}, \dots, π_{T_{j}}^{(j)}}$ and $\sum_{t = 1}^{T_{j}} π_{t}^{(j)} = 1$ . We denote $π_{1}^{(j)}$ as the incident rate for heterozygous deletion, $π_{2}^{(j)}$ as the probability of a bin residing in a null region, and $π_{3}^{(j)}, \dots, π_{T_{j}}^{(j)}$ as the incident rates for amplifications in cell j. The optimal number of copy number groups T_j is determined by the Bayesian information criterion (BIC) for within each cell separately (Figure S10). By introducing the α_ij term, we aim to unmask biases based on a mixture of Poisson distributions with observed-data likelihood:

ϕ (Y_{i j}) = \sum_{t = 1}^{T_{j}} π_{t}^{(j)} ϕ_{t} (Y_{i j}) = \sum_{t = 1}^{T_{j}} π_{t}^{(j)} pPoisson (Y_{i j}; N_{j} β_{i} f_{j} (G C_{i}) \frac{t}{2} \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})) .

For each individual cell j, we adopt an EM algorithm within the iterative estimation procedure, where the missing data indicates carrier status:

Z_{i t}^{(j)} = {\begin{array}{l} 1 & if bin i of cell j is from copy number group t, \\ 0 & otherwise. \end{array}

Given $Z_{i t}^{(j)}$ , we have $α_{i j} = \sum_{t = 1}^{T_{j}} (Z_{i t}^{(j)} \times t / 2)$ .

For parameter estimation, SCOPE adopts an iterative procedure, where EM is embedded as one step to estimate the GC content bias. For EM initializations, SCOPE utilizes ploidy estimates from a first-pass normalization run to ensure fast convergence and to avoid local optima. See algorithmic details under Iterative Parameter Estimation Procedure. For estimation of K (the number of latent factors), SCOPE by default adopts as a model selection metric another BIC based on normalization results across all cells and regions. We demonstrate that with the EM algorithm to account for the local genomic contexts and the use of normal cells to estimate bin-specific noise terms, the procedure by SCOPE (outlined above) is robust to the different choice of K (see under “Performance assessment via spike-in studies and with varying parameters” for more details).

Iterative Parameter Estimation Procedure

Initialization

Identify negative control cells using cell-specific Gini coefficients. Apply Poisson latent factor model with negative controls to obtain $\hat{λ}$ , $\hat{f} (G C)$ , and $\hat{β}$ . Let $r_{i j} = Y_{i j} \times 2 / {\hat{λ}}_{i j}$ be the estimated relative copy number, which has mean two across all cells. Denote P_j as the ploidy for cell j and $r_{i j}^{*} = Y_{i j} \times P_{j} / {\hat{λ}}_{i j}$ as the absolute copy number. We pre-estimate P_j to aid EM initialization:

{\hat{P}}_{j} = \underset{P_{j} \in [1.5, \dots, 6]}{argmin} {\sum_{i = 1}^{m} {(r_{i j}^{*} - ⌈ r_{i j}^{*} ⌉)}^{2}},

where $⌈ r_{i j}^{*} ⌉$ rounds $r_{i j}^{*}$ to the nearest integer. Given ${\hat{P}}_{j}$ , we initialize $Z_{i t}^{(j)} = 1$ if $t = ⌈ \frac{Y_{i j} \times {\hat{P}}_{j}}{{\hat{N}}_{j} {\hat{β}}_{i} {\hat{f}}_{j} (G C_{j})} ⌉$ and zero otherwise. We initialize g and h to be all zeros and $β = \hat{β}$ .

Iteration

Given β, g, h, and Z⁽¹⁾,…, Z⁽ⁿ⁾ from previous iteration,
1. M-step:
  ${\hat{π}}_{t}^{(j)} = \frac{1}{m} \sum_{i = 1}^{m} {\hat{Z}}_{i t}^{(j)} for all t \in {1, \dots, T_{j}} .$
  
  For each cell j, fit the LOESS curve of $\frac{Y_{i j}}{N_{j} β_{i} \sum_{t = 1}^{T_{j}} (\frac{t}{2} \times {\hat{Z}}_{i t}^{(j)}) \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})} \sim G C_{i}$ and use the fitted value as f_j(GC_i).
2. E-step:
  $p_{i t}^{(j)} = {\hat{π}}_{t}^{(j)} pPoisson (Y_{i j}; N_{j} β_{i} f_{j} (G C_{i}) \frac{t}{2} \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})), {\hat{Z}}_{i t}^{(j)} = E [Z_{i t}^{(j)} = 1 | Y_{i j}, N_{j}, β_{i}, f_{j} (G C_{i}), g, h] = \frac{p_{i t}^{(j)}}{\sum_{t^{*} = 1}^{T} p_{i t^{*}}^{(j)}} .$
3. Repeat a) – b) till convergence.
Given f(GC), g, h, and Z⁽¹⁾,…, Z⁽ⁿ⁾, let J_c be the indices of negative control cells:
$β_{i} = \underset{{j | j \in J_{c}}}{median} (\frac{Y_{i j}}{N_{j} f_{j} (G C_{i}) \exp \sum_{k = 1}^{K} g_{i k} h_{j k}}) .$
Given β, f(GC), and Z⁽¹⁾,…, Z⁽ⁿ⁾, let h^old be the estimated h from the previous step.
1. For each bin i, fit Poisson log-linear regression with ${Y_{i J}}_{c}$ as response, ${h_{J_{c} 1}^{o l d}, h_{J_{c} 2}^{o l d}, \dots, h_{J_{c} K}^{o l d}}$ as covariates, $\log [N_{J_{c}} f_{J_{c}} (G C_{i}) β_{i}]$ as fixed offset to obtain updated estimates as {g_i1, ^…, g_ik}.
2. For each cell j, fit Poisson log-linear regression with Y_:j as response, {g_i1, ^…, g_ik} as covariates, $\log [N_{j} {\hat{f}}_{j} (G C) \hat{β} \sum_{t = 1}^{T_{j}} (\frac{t}{2} \times {\hat{Z}}_{: t}^{(j)})]$ as fixed offset to obtain updated estimates ${h_{j 1}^{n e w}, h_{j 2}^{n e w}, \dots, h_{j K}^{n e w}}$ .
3. Center each row of g × (h^new)^T and apply SVD to the row-centered matrix to obtain the K right singular vectors to update h^new.
4. Repeat a) – c) with h^old = h^new till convergence.
Repeat steps 1 – 3 till convergence.

Identification of Negative Control Cells

In most, if not all, single-cell cancer genomics studies, diploid cells are inevitably picked up for sequencing from adjacent normal tissues, and they can thus serve as normal controls for read depth normalization. However, not all platforms/experiments allow or adopt flow-sorting based techniques before scDNA-seq, and cell ploidy and case-control labeling are therefore not always readily available. To solve this issue, SCOPE opts to use the scDNA-seq data to in silico identify normal cells as controls. Specifically, for each cell, we calculate its Gini coefficient as two times the area between the diagonal line and the Lorenz curve. Equivalently, the Gini coefficient of cell j can be calculated as

G i n i_{j} = \frac{\sum_{i = 1}^{m} \sum_{k = 1}^{m} | Y_{i j} - Y_{k j} |}{2 m \sum_{i = 1}^{m} Y_{i j}},

which serves as a robust and scale-independent measurement of coverage uniformity. As such, cell-specific Gini coefficients can be used as good proxies for identifying cell outliers, which have extreme coverage distribution due to failed library preparation, and for indexing normal cells out of the entire cell population.

The utility of Gini coefficients is supported by empirical evidence showing that diploid, hyperdiploid, and hypodiploid cells, categorized by single-cell flow sorting, can be classified using the estimated Gini indices almost perfectly (Figure S11). This suggests that the Gini coefficient is an effective metric, with which to identify negative control samples. In practice, SCOPE does not require identification of all diploid cells from the cell population, only requiring 10 to 20 cells to serve as normal controls (refer to the section of “Performance assessment via spike-in studies and with varying parameters” on choosing the Gini coefficient threshold). The normal cells identified are then used in the cross-sample normalization step as normal controls to estimate bin- specific noise terms {β_i, g_i1, …, g_iK | 1 ≤ i ≤ m} that are not biased by the CNV signals. Refer to Iterative Parameter Estimation Procedure for details.

Detecting Simultaneous Changepoints Across Cells

For segmentation, SCOPE adopts a Poisson likelihood-based recursive segmentation procedure to identify changepoints/breakpoints that are shared across cells from the same subclone. Specifically, for cell j, let Y_Sj,…,Y_tj and $λ_{s j}^{0}, \dots, λ_{t j}^{0}$ be the observed and normalized read depth from a region spanning bin s to bin t, where $λ_{i j}^{0} = N_{j} β_{i} f_{j} (G C_{i}) \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})$ is the “control” depth of coverage under the null, i.e., the coverage we expect to see if there is no CNV. We further denote $Y_{s : t, j} = \sum_{i = s}^{t} Y_{i j}$ and $λ_{s : t, j}^{0} = \sum_{i = s}^{t} λ_{i j}^{0}$ . The scan statistic for one cell is max_s,t, U_j(s,t), where U_j(s,t) is calculated from three sub-splits:

U_{j} (s, t) = U_{j}^{*} (1, s - 1) + U_{j}^{*} (s, t) + U_{j}^{*} (t + 1, m)

U_{j}^{*} (s, t) = \sup_{μ_{s : t, j}} (\log (\frac{μ_{s : t, j}^{Y_{s : t, j}} \exp (- μ_{s : t, j})}{{λ_{s : t, j}^{0}}^{Y_{s : t, j}} \exp (- λ_{s : t, j}^{0})})) = Y_{s : t, j} \log (\frac{Y_{s : t, j}}{λ_{s : t, j}^{0}}) - (Y_{s : t, j} - λ_{s : t, j}^{0}) .

$U_{j}^{*} (s, t)$ is a generalized likelihood ratio test statistic, derived from the null model $Y_{s : t, j} \sim Poisson (λ_{s : t, j}^{0})$ against the alternative Y_s:t,j ~ Poisson(μ_s:t,j) (Zhang et al., 2010). For simultaneous changepoint detection across all cells, the scan statistic is max_s,_tZ_s,t, where

Z_{s, t} = \sum_{j = 1}^{n} U_{j} (s, t) .

SCOPE performs an iterative segmentation procedure given Z_s,t as the scan statistic, and uses a cross-sample modified BIC (mBIC)(Zhang and Siegmund, 2012) as the stopping rule:

mBIC (p) = \log (\frac{L_{τ}}{L_{0}}) - \frac{P}{2} \log \frac{2 \log (L_{τ} / L_{0})}{P} - \frac{P}{2} - \log (\begin{matrix} m \\ p \end{matrix}) - p (κ_{1} - κ_{2}) - \sum_{ρ = 1}^{p} \log (\sum_{j th carrier} {\hat{δ}}_{ρ, j}^{2}) + P \log π + (n p - P) \log (1 - π),

where $1 = τ_{0} < τ_{1} < τ_{2} < \dots < τ_{p} < τ_{p + 1} = m$ denote the changepoints that are shared across cells; log(L_T/L₀) is the generalized log-likelihood ratio for the alternative model, which has p changepoints, against the null model, which has no changepoints; P denotes the total number of shift parameters (δ), where $δ_{ρ, j} = λ_{ρ, j}^{0} - λ_{ρ - 1, j}^{0}$ for cell j with changepoint ρ (1 ≤ ρ ≤ p); and k₁ and k₂ are pre-defined numerical constants. When the carrier probability π is not known a priori, it is estimated empirically by

\hat{π} = P / (n p) .

The optimal number of changepoints is determined via max_p mBIC(p). The mBIC for regularizing the segmentation considers the number of carriers for each breakpoint, thereby accommodating subclonal events. For more details on the interpretation of the terms in mBIC, see Zhang and Siegmund (2012). After segmentation, SCOPE reports integer-valued copy numbers and allows direct ploidy estimation, which is calculated as the weighted average of the estimated copy number across the genome.

Inferring Subclones and Downstream Analysis

Upon completion of normalization and segmentation at a first pass, SCOPE includes the option to cluster cells based on the matrix of normalized z-scores, estimated copy numbers, or estimated changepoints, a process that returns clusters of cells from the same subclones with the same mutational profiles. Given the inferred subclones, SCOPE can identify a more complete set of normal cells as controls and further opts to perform a second round of group-wise ploidy initialization and normalization. For ploidy initialization, the joint estimation procedure improves its stability and accuracy; for normalization, the missing data $Z_{i t}^{(c)}$ and the incident rate parameters $π_{t}^{(c)}$ in the EM algorithm are shared across cells from the same subclone c to optimize the normalization step. Refer to Iterative Parameter Estimation Procedure with Shared Clonal Memberships for algorithmic details.

In addition, cells from the same subclone can be further combined together to generate pseudo-bulk samples that are presumably homogeneous. Another normalization on the in silico generated pseudo-bulk samples returns copy number profiles with higher resolution. The pseudo-bulk samples also have the potential for somatic point mutation profiling, with the sequencing sparsity averaged out – a recent report (Laks et al., 2018) demonstrated that merging cell subsets with shared copy numbers enabled inference of clone-specific single-nucleotide resolution events and clonal phylogenies.

Iterative Parameter Estimation Procedure with Shared Clonal Memberships

Initialization

Let S_j indicate the clonal membership for cell j and S_j = c if it is from subclone c (1 ≤ c ≤ C). Apply Poisson latent factor model with the more complete set of negative controls to obtain $\hat{λ}$ , $\hat{f} (G C)$ , and $\hat{β}$ . Let $r_{i j} = Y_{i j} \times 2 / {\hat{λ}}_{i j}$ be the estimated relative copy number, which has mean two across all cells. Denote P_c as the subclone-specific ploidy and $r_{i j}^{*} = Y_{i j} \times P_{S_{j}} / {\hat{λ}}_{i j}$ as the absolute copy number. We pre-estimate P_c to aid EM initialization:

{\hat{P}}_{c} = \underset{P_{c} \in [1.5, \dots, 6}}{argmin} \sum_{{j : S_{j} = c}} \sum_{i = 1}^{m} {(r_{i j}^{*} - ⌈ r_{i j}^{*} ⌉)}^{2},

where $⌈ r_{i j}^{*} ⌉$ rounds $r_{i j}^{*}$ to the nearest integer. Given ${\hat{P}}_{c}$ , we initialize $Z_{i t}^{(c)} = 1$ if $t = \underset{{j : S_{j} = c}}{median} (⌈ \frac{Y_{i j} \times {\hat{P}}_{S_{j}}}{{\hat{N}}_{j} {\hat{β}}_{i} {\hat{f}}_{j} (G C_{i})} ⌉)$ and zero otherwise. We initialize g and h to be all zeros and $β = \hat{β}$ .

Iteration

Given β, g, h, and Z⁽¹⁾,^…,Z^(C) from previous iteration,
1. M-step:
  ${\hat{π}}_{t}^{(c)} = \frac{1}{m} \sum_{i = 1}^{m} {\hat{Z}}_{i t}^{(c)} for all t \in {1, \dots, T_{c}} .$
  
  For each cell j in subclone c, fit the LOESS curve of $\frac{Y_{i j}}{N_{j} β_{i} \sum_{t = 1}^{T_{C}} (\frac{t}{2} \times Z_{i t}^{(c)}) \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})} \sim G C_{i}$ and use the fitted value as f_i(GC_i).
2. E-step:
  $p_{i t}^{(c)} = {\hat{π}}_{t}^{(c)} \prod_{{j : s_{j} = c}} pPoisson (Y_{i j}; N_{j} β_{i} f_{j} (G C_{i}) \frac{t}{2} \exp (\sum_{k = 1}^{K} g_{i k} h_{j k})), {\hat{z}}_{i t}^{(c)} = E [Z_{i t}^{(c)} = 1 | Y_{i j}, N_{j,}, β_{i}, f_{j} (G C_{i}), g, h] = \frac{p_{i t}^{(c)}}{\sum_{t^{*} = 1}^{T} p_{i t^{*}}^{(c)}} .$
3. Repeat a) – b) till convergence.
Given f(GC), g, h, and Z⁽¹⁾,^…,Z^(C), let J_C = {j: Sj = c₀} be the indices of negative control cells:
$β_{i} = \underset{{j | j \in J_{c}}}{median} (\frac{Y_{i j}}{N_{j} f_{j} (G C_{i}) \exp \sum_{k = 1}^{K} g_{i k} h_{j k}}) .$
Given β, f(GC), and Z⁽¹⁾,^…,Z^(C), let h^old be the estimated h from the previous step.
1. For each bin i, fit Poisson log-linear regression with ${Y_{i J}}_{c}$ as response, ${h_{J_{c} 1}^{o l d}, h_{J_{c} 2}^{o l d}, \dots, h_{J_{c} K}^{o l d}}$ as covariates, $\log [N_{J_{c}} f_{J_{c}} (G C_{i}) β_{i}]$ as fixed offset to obtain updated estimates as {g_i1,^…,g_ik }.
2. For each cell j fit Poisson log-linear regression with Y_:j as response, {g_i1,^…,g_ik } as covariates, $\log [N_{j} {\hat{f}}_{j} (G C) \hat{β} \sum_{t = 1}^{T_{j \in c}} (\frac{t}{2} \times {\hat{Z}}_{: t}^{S_{j}})]$ as fixed offset to obtain updated estimates ${h_{j 1}^{n e w}, h_{j 2}^{n e w}, \dots, h_{j K}^{n e w}}$ .
3. Center each row of g × (h^new)^T and apply SVD to the row-centered matrix to obtain the K right singular vectors to update h^uew.
4. Repeat a) – c) with h^old = h^new till convergence.
Repeat steps 1 – 3 till convergence.

Bioinformatic Pre-Processing and Genomic Binning

For bioinformatic pre-processing, we adopt BWA (Li and Durbin, 2010) to align reads and SAMtools (Li et al., 2009) to add read group, dedup, sort, and index bam files. For data from the 10X Genomics, reads that contain cellular barcodes from the barcode list of interest are demultiplexed using a Python script. For genomic binning, SCOPE enables user-defined genome-wide consecutive bins and by its default chooses a fixed bin size. To compute the depth of coverage, SCOPE removes reads that are mapped to multiple genomic locations and to “blacklist” regions, including segmental duplication regions and gaps in reference assembly (i.e., telomere, centromere, and heterochromatin regions). This is followed by an additional step of quality control to remove bins with extreme mappability.

To compute mappability for hg19, we employed the 100-mers mappability track from the ENCODE Project (Derrien et al., 2012) and computed weighted average of the mappability scores if multiple ENCODE regions overlap with the same bin. To calculate mappability for hg38, we adopted the UCSC liftOver utility (http://hqdownload.cse.ucsc.edu/qoldenpath/hq19/liftOver/). For other reference genomes, we first construct consecutive reads that are one base pair apart along the bin. The length of the reads is set to be the same as that from the sequencing platform and the read sequences are taken from the reference genome of interest. We then find possible positions across the genome that the reads can map to allowing for a default number of mismatches. Finally, we compute the mean of the probabilities that the overlapped reads map to the target places where they are generated and use this as the mappability of the bin.

QUANTIFICATION AND STATISTICAL ANALYSIS

Details of statistical analysis and software used in this paper are included in Methods Details.

DATA AND CODE AVAILABILITY

SCOPE is an open-source Bioconductor R package available at https://bioconductor.org/packages/SCOPE/.

Supplementary Material

NIHMS1581415-supplement-1.pdf^{(16MB, pdf)}

NIHMS1581415-supplement-2.docx^{(27.8KB, docx)}

NIHMS1581415-supplement-3.pdf^{(3.5MB, pdf)}

KEY RESOURCES TABLE.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
scDNA-seq of breast cancer	Navin et al. (2011)	SRA: SRA018951
aCGH of breast cancer	Navin et al. (2011)	GEO: GSE16607
scDNA-seq of TNBC	Kim et al. (2018)	SRA: SRP114962
scRNA-seq of TNBC	Kim et al. (2018)	SRA: SRP114962
Bulk WES of TNBC	Kim et al. (2018)	SRA: SRP114962
scDNA-seq of gastric cancer cell spike-ins	10X Genomics website	https://support.10xgenomics.com/single-cell-dna/datasets
scDNA-seq of 10,000 break cancer nuclei	10X Genomics website	https://support.10xgenomics.com/single-cell-dna/datasets
Software and Algorithms
BWA	Li and Durbin (2010)	http://bio-bwa.sourceforge.net/
SAMtools	Li et al. (2009)	http://samtools.sourceforge.net/
MuTect2	Cibulskis et al. (2013)	https://software.broadinstitute.org/gatk/
ANNOVAR	Wang et al. (2010)	http://annovar.openbioinformatics.org/en/latest/
PyClone	Roth et al. (2014)	https://bitbucket.org/aroth85/pyclone/wiki/Home
seqCBS	Shen and Zhang (2012)	https://cran.r-project.org/web/packages/seqCBS/index.html
SALMON	Patro et al. (2017)	https://github.com/COMBINE-lab/salmon
InferCNV	Patel et al. (2014)	https://github.com/broadinstitute/inferCNV
CODEX2	Jiang et al. (2018)	https://github.com/yuchaojiang/CODEX2
Ginkgo	Garvin et al. (2015)	https://github.com/robertaboukhalil/ginkgo
HMMcopy	Laks et al. (2018)	https://github.com/shahcompbio/single_cell_pipeline/tree/master/single_cell/workflows/hmmcopy
SCOPE	This paper	https://bioconductor.org/packages/SCOPE/

Open in a new tab

Highlights.

SCOPE normalizes scDNA-seq data and profiles copy number variations
SCOPE accounts for the aberrant copy number changes for normalization
SCOPE estimates ploidy directly without need for post hoc adjustment
SCOPE performs cross-sample segmentation to identify shared breakpoints

ACKNOWLEDGEMENTS

This work was supported by the National Institutes of Health (NIH) grant P01 CA142538 (to DYL and YJ), R35 GM118102 (to YJ), a developmental award from the UNC Lineberger Comprehensive Cancer Center 2017T109 (to YJ), and a pilot award from the UNC Computational Medicine Program (to YJ).

Footnotes

DECLARATION OF INTERESTS

The authors declare no conflict of interest.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

REFERENCES

Anders S, and Huber W (2010). Differential expression analysis for sequence count data. Genome Biol 11, R106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, Troge J, Ravi K, Esposito D, Lakshmi Bv et al. (2012). Genome-wide copy number analysis of single cells. Nat Protoc 7, 1024–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30, 413–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H, Jiang Y, Maxwell KN, Nathanson KL, and Zhang N (2017). Allele-Specific Copy Number Estimation by Whole Exome Sequencing. Ann Appl Stat 11, 1169–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, and Getz G (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y, Du Jv et al. (2002). Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 99, 5261–5266. [DOI] [PMC free article] [PubMed] [Google Scholar]
Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, and Ribeca P (2012). Fast computation and applications of genome mappability. PLoS One 7, e30377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lee HO, Lee S, Ryu DE, Lee S, Xue C, Kim SJ, Kim K, Barkas N, Park PJ, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res 28, 1217–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan X, Edrisi M, Navin N, and Nakhleh L (2019). Benchmarking Tools for Copy Number Aberration Detection from Single-cell DNA Sequencing Data. bioRxiv, 696179. [Google Scholar]
Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole CG, Ward S, Dawson E, Ponting Lv et al. (2017). COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res 45, D777–D783. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao R, Davis A, McDonald TO, Sei E, Shi X, Wang Y, Tsai PC, Casasent A, Waters J, Zhang Hv et al. (2016). Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat Genet 48, 1119–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, Wigler M, and Schatz MC (2015). Interactive analysis and assessment of single-cell copy-number variations. Nat Methods 12, 1058–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, Giuliany R, Rosner J, Oloumi A, Shumansky Kv et al. (2012). Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res 22, 1995–2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Y, Oldridge DA, Diskin SJ, and Zhang NR (2015). CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 43, e39. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Y, Qiu Y, Minn AJ, and Zhang NR (2016). Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc Natl Acad Sci U S A 113, E5528–5537. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Y, Wang R, Urrutia E, Anastopoulos IN, Nathanson KL, and Zhang NR (2018). CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing. Genome Biol 19, 202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim C, Gao R, Sei E, Brandt R, Hartman J, Hatschek T, Crosetto N, Foukakis T, and Navin NE (2018). Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing. Cell 173, 879–893 e813. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laks E, Zahn H, Lai D, McPherson A, Steif A, Brimhall J, Biele J, Wang B, Masud T, and Grewal D (2018). Resource: Scalable whole genome sequencing of 40,000 single cells identifies stochastic aneuploidies, genome replication states and clonal repertoires. bioRxiv, 411058. [Google Scholar]
Li H, and Durbin R (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26, 589–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and Genome Project Data Processing, S. (2009). The Sequence Alignment/Map format anc SAMtools. Bioinformatics 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Adhav R, and Xu X (2017). Current Progresses of Single Cell DNA Sequencing in Breast Cancel Research. Int J Biol Sci 13, 949–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCarroll SA, and Altshuler DM (2007). Copy-number variation and association studies of human disease. Nat Genet 39, S37–42. [DOI] [PubMed] [Google Scholar]
Muller S, Cho A, Liu SJ, Lim DA, and Diaz A (2018). CONICS integrates scRNA-seq with DNA sequencing to map gene expression to tumor sub-clones. Bioinformatics 34, 3217–3219. [DOI] [PMC free article] [PubMed] [Google Scholar]
Navin N, Kendall J, Troge J, Andrews P, Rodgers L, Mclndoo J, Cook K, Stepansky A, Levy D Esposito D, et al. (2011). Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V, et al. (2010). Inferring tumor progression from genomic heterogeneity. Genome Res 20, 68–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Navin NE (2014). Cancer genomics: one cell at a time. Genome Biol 15, 452. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oesper L, Mahmoody A, and Raphael BJ (2013). THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol 14, R80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumora heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patro R, Duggal G, Love MI, Irizarry RA, and Kingsford C (2017). Salmon provides fast and bias aware quantification of transcript expression. Nat Methods 14, 417–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
Risso D, Ngai J, Speed TP, and Dudoit S (2014). Normalization of RNA-seq data using factor analysi: of control genes or samples. Nat Biotechnol 32, 896–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S, Bouchard-Cote A, and Shah SP (2014). PyClone: statistical inference of clonal population structure in cancer. Nat Method: 11, 396–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen JJ, and Zhang NR (2012). Change-Point Model on Nonhomogeneous Poisson Processes with Application in Copy Number Profiling by Next-Generation DNA Sequencing. Annals of Applied Statistics 6, 476–496. [Google Scholar]
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, and Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shlien A, and Malkin D (2009). Copy number variations and cancer. Genome Med 1, 62. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, Marjoram P, Siegmund K, Press MF, Shibata D, et al. (2015). A Big Bang model of human colorectal tumor growth. Nat Genei 47, 209–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, et al. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tirosh I, Venteicher AS, Hebert C, Escalante LE, Patel AP, Yizhak K, Fisher JM, Rodman C, Mount C, Filbin MG, et al. (2016). Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539, 309–313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Urrutia E, Chen H, Zhou Z, Zhang NR, and Jiang Y (2018). Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34, 2126–2128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang K, Li M, and Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang X, Chen H, and Zhang NR (2018). DNA copy number profiling using single-cell sequencing. Brief Bioinform 19, 731–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang H, and Wang K (2015). Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc 10, 1556–1566. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang NR, and Siegmund DO (2012). Model Selection for High-Dimensional, Multi-Sequence Change-Point Problems. Stat Sinica 22, 1507–1538. [Google Scholar]
Zhang NR, Siegmund DO, Ji H, and Li JZ (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97, 631–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zong C, Lu S, Chapman AR, and Xie XS (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622–1626. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1581415-supplement-1.pdf^{(16MB, pdf)}

NIHMS1581415-supplement-2.docx^{(27.8KB, docx)}

NIHMS1581415-supplement-3.pdf^{(3.5MB, pdf)}

Data Availability Statement

SCOPE is an open-source Bioconductor R package available at https://bioconductor.org/packages/SCOPE/.

[R1] Anders S, and Huber W (2010). Differential expression analysis for sequence count data. Genome Biol 11, R106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, Troge J, Ravi K, Esposito D, Lakshmi Bv et al. (2012). Genome-wide copy number analysis of single cells. Nat Protoc 7, 1024–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30, 413–421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen H, Jiang Y, Maxwell KN, Nathanson KL, and Zhang N (2017). Allele-Specific Copy Number Estimation by Whole Exome Sequencing. Ann Appl Stat 11, 1169–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, and Getz G (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y, Du Jv et al. (2002). Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 99, 5261–5266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, and Ribeca P (2012). Fast computation and applications of genome mappability. PLoS One 7, e30377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fan J, Lee HO, Lee S, Ryu DE, Lee S, Xue C, Kim SJ, Kim K, Barkas N, Park PJ, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res 28, 1217–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan X, Edrisi M, Navin N, and Nakhleh L (2019). Benchmarking Tools for Copy Number Aberration Detection from Single-cell DNA Sequencing Data. bioRxiv, 696179. [Google Scholar]

[R10] Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole CG, Ward S, Dawson E, Ponting Lv et al. (2017). COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res 45, D777–D783. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Gao R, Davis A, McDonald TO, Sei E, Shi X, Wang Y, Tsai PC, Casasent A, Waters J, Zhang Hv et al. (2016). Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat Genet 48, 1119–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, Wigler M, and Schatz MC (2015). Interactive analysis and assessment of single-cell copy-number variations. Nat Methods 12, 1058–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, Giuliany R, Rosner J, Oloumi A, Shumansky Kv et al. (2012). Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res 22, 1995–2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Jiang Y, Oldridge DA, Diskin SJ, and Zhang NR (2015). CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 43, e39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Jiang Y, Qiu Y, Minn AJ, and Zhang NR (2016). Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc Natl Acad Sci U S A 113, E5528–5537. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Jiang Y, Wang R, Urrutia E, Anastopoulos IN, Nathanson KL, and Zhang NR (2018). CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing. Genome Biol 19, 202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kim C, Gao R, Sei E, Brandt R, Hartman J, Hatschek T, Crosetto N, Foukakis T, and Navin NE (2018). Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing. Cell 173, 879–893 e813. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Laks E, Zahn H, Lai D, McPherson A, Steif A, Brimhall J, Biele J, Wang B, Masud T, and Grewal D (2018). Resource: Scalable whole genome sequencing of 40,000 single cells identifies stochastic aneuploidies, genome replication states and clonal repertoires. bioRxiv, 411058. [Google Scholar]

[R19] Li H, and Durbin R (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26, 589–595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and Genome Project Data Processing, S. (2009). The Sequence Alignment/Map format anc SAMtools. Bioinformatics 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Liu J, Adhav R, and Xu X (2017). Current Progresses of Single Cell DNA Sequencing in Breast Cancel Research. Int J Biol Sci 13, 949–960. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] McCarroll SA, and Altshuler DM (2007). Copy-number variation and association studies of human disease. Nat Genet 39, S37–42. [DOI] [PubMed] [Google Scholar]

[R23] Muller S, Cho A, Liu SJ, Lim DA, and Diaz A (2018). CONICS integrates scRNA-seq with DNA sequencing to map gene expression to tumor sub-clones. Bioinformatics 34, 3217–3219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Navin N, Kendall J, Troge J, Andrews P, Rodgers L, Mclndoo J, Cook K, Stepansky A, Levy D Esposito D, et al. (2011). Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V, et al. (2010). Inferring tumor progression from genomic heterogeneity. Genome Res 20, 68–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Navin NE (2014). Cancer genomics: one cell at a time. Genome Biol 15, 452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Oesper L, Mahmoody A, and Raphael BJ (2013). THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol 14, R80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumora heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Patro R, Duggal G, Love MI, Irizarry RA, and Kingsford C (2017). Salmon provides fast and bias aware quantification of transcript expression. Nat Methods 14, 417–419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Risso D, Ngai J, Speed TP, and Dudoit S (2014). Normalization of RNA-seq data using factor analysi: of control genes or samples. Nat Biotechnol 32, 896–902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S, Bouchard-Cote A, and Shah SP (2014). PyClone: statistical inference of clonal population structure in cancer. Nat Method: 11, 396–398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Shen JJ, and Zhang NR (2012). Change-Point Model on Nonhomogeneous Poisson Processes with Application in Copy Number Profiling by Next-Generation DNA Sequencing. Annals of Applied Statistics 6, 476–496. [Google Scholar]

[R33] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, and Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Shlien A, and Malkin D (2009). Copy number variations and cancer. Genome Med 1, 62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, Marjoram P, Siegmund K, Press MF, Shibata D, et al. (2015). A Big Bang model of human colorectal tumor growth. Nat Genei 47, 209–216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, et al. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Tirosh I, Venteicher AS, Hebert C, Escalante LE, Patel AP, Yizhak K, Fisher JM, Rodman C, Mount C, Filbin MG, et al. (2016). Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539, 309–313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Urrutia E, Chen H, Zhou Z, Zhang NR, and Jiang Y (2018). Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34, 2126–2128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Wang K, Li M, and Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Wang X, Chen H, and Zhang NR (2018). DNA copy number profiling using single-cell sequencing. Brief Bioinform 19, 731–736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Yang H, and Wang K (2015). Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc 10, 1556–1566. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Zhang NR, and Siegmund DO (2012). Model Selection for High-Dimensional, Multi-Sequence Change-Point Problems. Stat Sinica 22, 1507–1538. [Google Scholar]

[R43] Zhang NR, Siegmund DO, Ji H, and Li JZ (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97, 631–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Zong C, Lu S, Chapman AR, and Xie XS (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622–1626. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing

Rujin Wang

Dan-Yu Lin

Yuchao Jiang

SUMMARY

eTOC Blurb

Graphical Abstract

INTRODUCTION

RESULTS

Figure 1. SCOPE estimates discretized copy numbers with EM algorithm to correct for GC content bias.

Analysis of scDNA-seq Data of Breast Cancer Patients with aCGH for Validation

Figure 2. SCOPE outperforms existing methods and successfully detects subclonal structures of a polygenomic tumor.

Analysis of scDNA-seq Data of Triple-Negative Breast Cancer Patients with Paired WES and scRNA-seq

Figure 3. Copy number profiles of triple-negative breast cancer patient KTN302, inferred by scDNA-seq, bulk WES, and scRNA-seq.

Analysis of scDNA-seq Data of Gastric Cancer Spike-ins and Breast Cancer Dissections from the 10X Genomics

Figure 4. Performance assessment of SCOPE via experimental and computational spike-in analysis.

Performance Assessment via Spike-in Studies with Varying Parameters

DISCUSSION

Key Changes Prompted by Reviewer Comments

STAR METHODS

LEAD CONTACT AND MATERIALS AVAILABILITY

METHODS DETAILS

SCOPE Model for Data Normalization

Iterative Parameter Estimation Procedure

Initialization

Iteration

Identification of Negative Control Cells

Detecting Simultaneous Changepoints Across Cells

Inferring Subclones and Downstream Analysis

Iterative Parameter Estimation Procedure with Shared Clonal Memberships

Initialization

Iteration

Bioinformatic Pre-Processing and Genomic Binning

QUANTIFICATION AND STATISTICAL ANALYSIS

DATA AND CODE AVAILABILITY

Supplementary Material

KEY RESOURCES TABLE.

Highlights.

ACKNOWLEDGEMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases