Abstract
Identification of de novo copy number variations (CNVs) across the genome in single cells requires single-cell whole-genome amplification (WGA) and sequencing. Although many experimental protocols of amplification methods have been developed, all suffer from uneven distribution of read depth across the genome after sequencing of DNA amplicons, which constrains the usage of conventional CNV calling methodologies. Here, we present SCCNV, a software tool for detecting CNVs from whole genome-amplified single cells. SCCNV is a read-depth based approach with adjustment for the WGA bias. We demonstrate its performance by analyzing data obtained with most of the single-cell amplification methods that have been employed for CNV analysis, including DOP-PCR, MDA, MALBAC, and LIANTI. SCCNV is freely available at https://github.com/biosinodx/SCCNV.
Keywords: single-cell whole-genome sequencing, single-cell whole-genome amplification, amplification bias, copy number variation, software development
Introduction
Each single cell in a tissue or cell population has its own unique genome due to accumulating de novo mutations, such as single-nucleotide variations (SNVs), structural variations (SVs), copy number variations (CNVs) and aneuploidies. The frequency and spectrum of the mutations reflect the loss of genome integrity of a cell population, critically important to cancer and aging (Vijg and Dong, 2020). To detect the mutations unique to a single cell, single-cell whole-genome sequencing (SCWGS) is necessary. SCWGS requires whole-genome amplification (WGA), which is often biased, leading to uneven distribution of DNA content across the genome or differences between alleles. This essentially constrain the usage of variant callers designed for non-amplified bulk DNA. We recently developed a new software tool, SCcaller, that uses heterozygous SNPs to correct for the allelic bias hampering SNV calling (Dong et al., 2017).
CNV calling is typically based on variation of sequencing depth across the genome. However, for a single cell amplicon, variation of sequencing depth increases dramatically due to the locus-specific amplification bias (Navin et al., 2011; Zong et al., 2012; Chen et al., 2017). To solve this issue computationally, we developed SCCNV, a software tool to identify CNVs from SCWGS. SCCNV is also based on a read-depth approach: it controls not only bias during sequencing and alignment, e.g., bias associated with mappability and GC content, but also the locus-specific amplification bias. We demonstrate the performance of SCCNV using SCWGS data of multiple experimental protocols, i.e., DOP-PCR (degenerative-oligonucleotide PCR), MDA (multiple displacement amplification), MALBAC (multiple annealing and looping–based amplification cycles), and LIANTI (linear amplification via transposon insertion) (Navin et al., 2011; Gundry et al., 2012; Zong et al., 2012; Chen et al., 2017; Dong et al., 2017).
Materials and Methods
SCCNV
Our software tool for analyzing single-cell copy number variation (SCCNV) was written in Python. Its source code is freely available with a usage description and an example at Github1 under the GNU Affero General Public License v3.0. It uses SCWGS data after alignment as input (i.e., a bam file per single cell). Of note, SCCNV cannot take sequencing data of a pool of single cells (a bam file composed of thousands of single cells data), e.g., the 10× Genomics single-cell copy number data, as input.
First, SCCNV divides the genome into bins of equal size (500 kb as default), and counts the numbers of reads per bin of a cell. This step is relatively time-consuming, and we suggest users to use samtools on a high-performance computer cluster in parallel for all samples to be time-efficient (see instructions on Github). The remaining major steps of SCCNV do not require much computational resources – most modern desktop computers should work well.
SCCNV then normalizes mappability, which indicates the efficiency of the alignment to a genomic region. For a bin b of a cell, SCCNV adjusts the raw number of reads, denoted by NRraw, by dividing over the mappability M,
(1) |
where mappability M is a value ranging from 0 to 1. SCCNV uses Encode Align100mer mappability score, downloaded from the UCSC genome browser, and calculates the mappability of each bin by using their weighted average.
Then, SCCNV normalizes for GC content. For a cell, SCCNV calculates the percentile of GC content of each bin. For a bin b of the cell, its number of aligned reads after normalizing GC content, NRGC,b, is,
(2) |
where NRmap,genome is the average NRmap per bin of all bins from the cell; NRmap,b,percentile is the average NRmap per bin of bins in the same GC percentile as bin b.
After normalization for mappability and GC content, a pattern of sequencing read depth emerges that is consistent across different cells amplified using the same experimental protocol, i.e., the locus-specific amplification bias. Therefore, the bias is normalized across all cells in a particular batch and experiment. First, to make the NRGC,b comparable across cells, SCCNV converts it to a raw copy number estimate, denoted by CNraw,b for bin b of cell c, as follows,
(3) |
where NRGC,genome,c is the median NRGC,c per bin in the genome of cell c; ploidy is 2 by default. Second, the adjusted copy number is estimated as,
(4) |
where CNraw,b,–c denotes the average CNraw for bin b across all cells except cell c. Of note, with this step SCCNV aims to discover the difference between the cell c and the other cells. When analyzing CNVs of multiple tumor cells, it is not appropriate to use all tumor cells as input of SCCNV; instead, one should use one tumor cell with two or more normal diploid cells as the input.
Then SCCNV uses a sliding window approach to further minimize amplification noise. By default, a window includes 11,500-kb bins, i.e., 5.5 Mb of DNA sequence in total, with a 500-kb step size between two neighboring windows,
(5) |
SCCNV then models the distribution of CNsmoothed,b,c of all bins in autosomes of a cell c as a normal distribution N(μ, σc2). The μ = 2, and σ is estimated as,
(6) |
where CNsmoothed,30.9%,c and CNsmoothed,69.1%,c are the 30.9 and 69.1% percentiles of the CNsmoothed,b,c of all bins in the autosomes, corresponding to the μ – 0.5σ and μ + 0.5σ percentiles, respectively. Here, we did not use the observed s.d. of CNsmoothed,b,c of all the bins because the normal distribution was to estimate amplification noise, not real variation in copy number across the genome. When a cell has several large CNVs, the s.d. will be high, even if its amplification noise remains low.
Assuming equally likely priors, for a bin b and a given possible copy number k ∈ {0, 1, 2, 3, 4}, its posterior probability is,
(7) |
where x is the CNsmoothed,b,c, and fi(x) is the probability density function of a normal distribution,
(8) |
where the variance σc2 is calculated according to Eq. (5). We only used k ∈ {0, 1, 2, 3, 4} because the final copy number call was after multiple testing correction, i.e., Eq. (9) below, and we wished to minimize the number of hypotheses tested, i.e., five for a copy number of 0–4. However, this will result in an underestimation if the real copy number exceeds four. To resolve this issue, for bins with copy numbers ≥4 and ≤100, SCCNV reports the closest integer to the CNsmoothed,b,c.
SCCNV allows <1 false positive per cell. Therefore, it determines bin b as a copy number variant when,
(9) |
Sensitivity and False Positive Rate
To determine copy number, SCCNV is based on a statistical test described in equations (8) and (9) for a normal distribution and multiple testing correction separately. With a given value of coefficient of variation (CV) of CNsmoothed,b,c, sensitivity and FPR can be estimated as follows. Sensitivity equals the difference between two cumulative distribution functions (CDFs) of Eq. (8) at the upper and lower boundaries, which SCCNV provides the correct CNV call after the correction in Eq. (9). The percentage of FP out of all bins is equal to the sum of (a) CDF at the lower boundary of SCCNV providing an incorrect CN gain call; and (b) 1 – CDF at the upper boundary of SCCNV providing an incorrect CN loss call. Then FPR was estimated as the ratio of % of FP to the sum of % of FP and % of TN.
For example, under the assumption that the true copy number is 2, if SCCNV calls CN = 2 when CNsmoothed,b,c is between 1.8 and 2.2, sensitivity = CDF(x = 2.2, μ = 2) – CDF(x = 1.8, μ = 2), in which CDF is the cumulative distribution function of Eq. (8). If SCCNV calls (a) CN = 1 when CNsmoothed,b,c is between 0.8 and 1.2, and (b) CN = 3 when CNsmoothed,b,c is between 2.8 and 3.2, then%FP = CDF(x = 1,2, μ = 2) + 1 – CDF(2.8, μ = 2).
Testing Datasets and Preprocessing of Data
Four SCWGS datasets were obtained for demonstrating and validating the performance of SCCNV (Zong et al., 2012; Lodato et al., 2015; Chen et al., 2017; Dong et al., 2017). The datasets included 8.2 TB SCWGS of 63 single human fibroblasts, neurons and cells of a tumor cell line amplified using eight different protocols, i.e., DOP-PCR (Sigma), Rubicon, MALBAC, LIANTI, and MDA (including Qiagen, GE, Lodato et al’s MDA and SCMDA). Supplementary Table 1 lists all the single-cell data used in this study.
Sequence alignment was performed using BWA and GATK as follows (Li and Durbin, 2009; McKenna et al., 2010). Raw sequencing data of each sample (single cell and bulk DNA) were obtained from the SRA database and subjected to quality control using FastQC (version 0.11.4;2) and trimming using Trim Galore (version 0.4.1;3) with default parameters. Then they were aligned to the human reference genome (version hg19) using BWA MEM (version 0.7.12; option: -t number of CPUs -M reference genome fasta file) (Li and Durbin, 2009). PCR duplications were removed using picard tools (version 1.119;4). The alignments were subjected to indel realignment and basepair recalibration using GATK (version 3.5; using options, RealignerTargetCreator, IndelRealigner, BaseRecalibrator, and PrintReads) (McKenna et al., 2010). The step above was used for generating an analysis-ready bam file for other types of variants, e.g., single nucleotide variants, small insertions and deletions, and this step is optional for large CNVs or aneuploidies using SCCNV. Reads with mapQ < 30 were discarded. The number of reads per bin of each sample was calculated using samtools (version 1.3; option: bedcov) (Li et al., 2009). SCCNV (version 1.0) was used to estimate CNV of each cell.
Results
Major Steps in SCCNV
As illustrated in Figure 1, SCCNV is composed of four major steps. It first calculates sequencing depth of the genome in bins of equal size (500-kb by default). Second, it normalizes the depth based on two features of the reference genome, including mappability and GC content. These two features are usually also considered by conventional CNV callers for bulk DNA sequencing. Next, it does further normalization across single cells of a same experimental batch. This step minimizes locus-specific bias due to WGA. Finally, it smooths the data (5 Mb by default) and infers copy number of each bin. Intermediate results between any two connecting steps can be generated by SCCNV for users to monitor its performance. We provide an example about the intermediate results of a normal neuronal nucleus amplified with MDA (SRA id: SRR2141574) in Supplementary Figures 1–4.
Performance on Real Datasets
To evaluate the performance of SCCNV, we obtained four SCWGS datasets from the SRA database, which includes 8.2 TB high-depth WGS data of 63 single human fibroblasts, neurons and cells of a tumor cell line amplified using eight different protocols, i.e., DOP-PCR (Sigma), Rubicon, MALBAC, LIANTI, and MDA (including four MDA protocols, Qiagen, GE, Lodato et al’s MDA and SCMDA) (Supplementary Table 1; Zong et al., 2012; Lodato et al., 2015; Chen et al., 2017; Dong et al., 2017). The data were processed as described in the Materials and Methods.
We used the CV of sequencing depth across all genomic bins on autosomes as an indicator of performance, because it directly determines sensitivity and FPR of copy number calling step in SCCNV. We show sensitivity and FPR of the copy number calling in Figures 2A,B, respectively. As the CV decreased from 0.135 to 0.041, the sensitivity increased from 0 to 100% and the FPR decreased from 3.8 × 10–3 to 3.9 × 10–11.
For the real datasets, we calculated the CV of raw data and normalized data after each step to demonstrate the performance of normalization in SCCNV (Figure 3). Almost all raw data (CV: 0.475 ± 0.135, avg. ± s.d.) are beyond the detection threshold, i.e., CV = 0.135. Each step of normalization decreased the CV by a significant fraction: on average, mappability normalization by 5%, GC content normalization by 33% percent, across-cell normalization by 22%, and smoothing by 55%. This shows the contributions of each normalization step to performance increase in the final variant calling. After all the normalization steps, the CVs are 0.107 ± 0.076 (avg. ± s.d.), corresponding to a sensitivity of 68.6% and an FPR of 3.6 × 10–4 on average. Of note, different amplification protocols have significantly different performance when using SCCNV, likely due to differences in DNA amplification linearity among the protocols. As expected, LIANTI outperformed all the others (Chen et al., 2017). Protocols that included PCR steps, i.e., DOP-PCR, MALBAC and Rubicon, ranked in the middle. Although known as suffering from the least artifactual SNVs (Dong et al., 2017; Zhang et al., 2019), MDA-based protocols were ended last (Figure 3).
Discussion
Identification of copy number variation and aneuploidy has been one of the major areas of genomics methods development. Several statistical models have been developed for analyzing initially microarray data and later sequencing data of bulk DNA, for example, Circular Binary Segmentation (CBS), Mean Shift-Based (MSB) model, Shifting Level Model (SLM), Expectation Maximization (EM) model, and Hidden Markov Model (HMM) as discussed in Zhao et al. (2013). Based on these models, multiple computational software tools have been developed, e.g., CBS, Copynumber, CNVnator, and HMMcopy (Olshen et al., 2004; Abyzov et al., 2011; Ha et al., 2012; Nilsen et al., 2012). To call CNVs, most of the methods rely on assessing either sequencing read depth or alternate allele fraction at heterozygous SNPs across the genome of one sample, i.e., across-genome normalization. Some of the methods have been applied directly for analyzing single-cell sequencing data with specific filtering for cells with too much bias after WGA.
A few new tools for single cell data were also developed recently under the same rationale (assessing one sample at a time, or across-genome normalization), such as AneuFinder, baseqCNV, Ginkgo and SCOPE (Garvin et al., 2015; Bakker et al., 2016; Fu et al., 2019; Wang et al., 2019). SCCNV was developed based on our observation that the locus-specific amplification bias is often the same in different cells within one experimental batch and amplified using the same protocol (e.g., Supplementary Figure 3); and we showed that normalization across multiple samples (cells) significantly contributed to the increase in variant calling performance for single cells amplified using most WGA protocols (Figure 3). Following the same across-sample normalization rationale, another software tool, SCNV, was developed (Wang et al., 2018). It differs from SCCNV that SCCNV performs normalization based on empirical data directly (Eq. 4) without any assumption on its distribution. Of note, with across-sample normalization, SCCNV essentially aims to identify differences among different cells in one input batch and, therefore, it is important to input cells of interest (e.g., tumor cells) together with cells with a standard diploid genome.
Conclusion
We developed SCCNV to identify copy number variations from whole-genome amplified single cells. We demonstrated its step-wise performance using most of the recent SCWGS datasets generated with 8 different amplification protocols.
Data Availability Statement
Raw sequencing data of each sample were obtained from SRA database (SRA SRP067062, SRA SRA060929, SRA SRP102259, SRA SRP041470, and SRA SRP061939).
Author Contributions
XD, LZ, and JV conceived the study. XD and TW developed the method. XD and XH analyzed the data. XD, LZ, and JV wrote the manuscript. All authors contributed to the article and approved the submitted version.
Conflict of Interest
XD, LZ, and JV are co-founders of SingulOmics Corp. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgments
This manuscript has been released as a Pre-Print at https://www.biorxiv.org/content/10.1101/535807v1 (Dong et al., 2019).
Funding. This work has been supported by the NIH grants P01 AG017242, P01 AG047200, P30 AG038072, K99 AG056656, U01 HL145560, and U01 ES029519, and the Paul F. Glenn Center for the Biology of Human Aging.
Code Availability Statement
The source code of SCCNV can be found in the https://github.com/biosinodx/SCCNV.
Supplementary Material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.505441/full#supplementary-material
References
- Abyzov A., Urban A. E., Snyder M., Gerstein M. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21 974–984. 10.1101/gr.114876.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bakker B., Taudt A., Belderbos M. E., Porubsky D., Spierings D. C., De Jong T. V., et al. (2016). Single-cell sequencing reveals karyotype heterogeneity in murine and human malignancies. Genome Biol. 17:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C., Xing D., Tan L., Li H., Zhou G., Huang L., et al. (2017). Single-cell whole-genome analyses by linear amplification via transposon insertion (LIANTI). Science 356 189–194. 10.1126/science.aak9787 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong X., Zhang L., Hao X., Wang T., Vijg J. (2019). SCCNV: a software tool for identifying copy number variation from single-cell whole-genome sequencing. bioRxiv [Preprint], 10.1101/535807 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong X., Zhang L., Milholland B., Lee M., Maslov A. Y., Wang T., et al. (2017). Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat. Methods 14 491–493. 10.1038/nmeth.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y., Zhang F., Zhang X., Yin J., Du M., Jiang M., et al. (2019). High-throughput single-cell whole-genome amplification through centrifugal emulsification and eMDA. Commun. Biol. 2:147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garvin T., Aboukhalil R., Kendall J., Baslan T., Atwal G. S., Hicks J., et al. (2015). Interactive analysis and assessment of single-cell copy-number variations. Nat. Methods 12 1058–1060. 10.1038/nmeth.3578 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gundry M., Li W., Maqbool S. B., Vijg J. (2012). Direct, genome-wide assessment of DNA mutations in single cells. Nucleic Acids Res. 40 2032–2040. 10.1093/nar/gkr949 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ha G., Roth A., Lai D., Bashashati A., Ding J., Goya R., et al. (2012). Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res. 22 1995–2007. 10.1101/gr.137570.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Durbin R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. (2009). The sequence alignment/Map format and SAMtools. Bioinformatics 25 2078–2079. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lodato M. A., Woodworth M. B., Lee S., Evrony G. D., Mehta B. K., Lee E., et al. (2015). Somatic mutation in single human neurons tracks developmental and transcriptional history. Science 350 94–98. 10.1126/science.aab1785 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., et al. (2010). The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navin N., Kendall J., Troge J., Andrews P., Rodgers L., Mcindoo J., et al. (2011). Tumour evolution inferred by single-cell sequencing. Nature 472 90–94. 10.1038/nature09807 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nilsen G., Liestol K., Van Loo P., Moen Vollan H. K., Eide M. B., Rueda O. M., et al. (2012). Copynumber: efficient algorithms for single- and multi-track copy number segmentation. BMC Genom. 13:591. 10.1186/1471-2164-13-591 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olshen A. B., Venkatraman E. S., Lucito R., Wigler M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572. 10.1093/biostatistics/kxh008 [DOI] [PubMed] [Google Scholar]
- Vijg J., Dong X. (2020). Pathogenic mechanisms of somatic mutation and genome mosaicism in aging. Cell 182 12–23. 10.1016/j.cell.2020.06.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang R., Lin D.-Y., Jiang Y. (2019). SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing. bioRxiv [Preprint], 10.1101/594267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X., Chen H., Zhang N. R. (2018). DNA copy number profiling using single-cell sequencing. Brief Bioinform. 19 731–736. 10.1093/bib/bbx004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang L., Dong X., Lee M., Maslov A. Y., Wang T., Vijg J. (2019). Single-cell whole-genome sequencing reveals the functional landscape of somatic mutations in B lymphocytes across the human lifespan. Proc. Natl. Acad. Sci. U.S.A. 116 9014–9019. 10.1073/pnas.1902510116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao M., Wang Q., Wang Q., Jia P., Zhao Z. (2013). Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinform. 14(Suppl. 11):S1. 10.1186/1471-2105-14-S11-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zong C., Lu S., Chapman A. R., Xie X. S. (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338 1622–1626. 10.1126/science.1229164 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw sequencing data of each sample were obtained from SRA database (SRA SRP067062, SRA SRA060929, SRA SRP102259, SRA SRP041470, and SRA SRP061939).