Abstract
Copy number variation (CNV) is a structural variation in human genome that has been associated with many complex diseases. In this paper we present a method to detect common copy number variation from next generation sequencing data. First, copy number variations are detected from each individual sample, which is formulated as a total variation penalized least square problem. Second, the common copy number discovery from multiple samples is obtained using source separation techniques such as the non-negative matrix factorization (NMF). Finally, the method is applied to population clustering. The results on real data analysis show that two family trio with different ancestries can be clustered into two ethnic groups based on their common CNVs, demonstrating the potential of the proposed method for application to population genetics.
I. Introduction
Next generation sequencing (NGS) technology provides a direct way to study human genome in the level of base pair, and thus has received widespread attention in biomedical applications within recent years. Unlike traditional technologies such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH), NGS is an high throughput technology that can output million or billion short reads from the shotgun sequencing, and thus provides high resolution mapping of genomic regions. The huge amount of data can be utilized for de novo assembly [1], single nucleotide polymorphisms (SNPs) calling [2], structural variations (SVs) detection [3], etc.
We focus on the detection of copy number variation (CNV) [4], which covers approximately 10% of human genome. CNV, as a major form of SV, has been associated with complex diseases such as autism [5], schizophrenia [6], Alzheimer disease [7], cancer [8], etc. CNVs are the duplication or deletion events of DNA segments with size more than 1 kbp [9]. There have existed several CNV detection methods [10], [11], [12], [13]; however, all of them focus on CNV detection from an individual sample, or two samples including a case and a control sample. In this paper, we consider the detection of common CNVs, which are the recurrent CNVs among a population. These common CNVs can be used for population clustering.
First, the method that was presented in [14] is used to detect CNVs from depth of coverage (DOC) of each sample. Then the non-negative matrix factorization (NMF) method [15] is employed to detect common CNVs. NMF is one of the source separation techniques [16]. Lee and Seung [17] showed that NMF can learn the common information from multiple data sources, which motivated us to apply the proposed method to detect common CNVs. The NMF models the data (detected CNVs in our problem) as the product of a source matrix and a contribution (or weight) matrix. Both of them are non-negative matrices. The source matrix includes the common CNVs, while the contribution matrix includes the weights of common CNVs in each sample. Therefore, using the contribution matrix can cluster population samples into different ethnic groups.
This paper is organized as follows: in Sec. II, the method to detect common CNVs is presented, based on total variation penalized least square optimization and NMF method. In Sec. III, the presented methods are used to population clustering. We processed a data set downloaded from the 1000 Genome Project for catalog of human genetic variations (www.1000genomes.org). The data set includes a CEU trio of European ancestry and a YRI trio of Yoruba Nigerian ancestry, which can be successfully classified based on their CNVs using our proposed approach. The paper is concluded in Sec. IV.
II. Methods
A. Copy number variation detection from single sample
The raw NGS data contains a huge amount of short reads. To detect CNVs, firstly these reads need to be mapped to the reference genome, e.g., build37 (or hg19) of human. After mapping, we can obtain the depth of coverage (DOC) by counting the number of mapped reads in the fixed-size, non-overlapping and consecutive windows [11]. Because of the correlation between G-C content and DOC [18], G-C content correction on DOC is often needed.
The data we start with are DOC yi, (i = 1, 2, …, N), where N is the number of windows. Because shotgun sequencing samples reads randomly on the genomic loci, the DOC is locally proportional to the copy number, so flat regions correspond to the same copy number. The detection of CNVs from DOC is modeled as a change-point detection problem, with the basic assumption that yi is piece-wise constant, and the basins/plateaus in yi correspond to deletions/duplications. Consequently, the CNV detection is formulated as the following total variation penalized least-square optimization problem:
| (1) |
where xi is the denoised or smoothed version of yi that can be used to call CNVs. The first term in (1) is the fitting error, and the second term is the total variation penalty. When a change-point presents between xi and xi+1, a penalty |xi+1−xi|, i.e., the absolute value of xi+1 − xi, is imposed. λ is the regularization parameter, which can control the tradeoff between fitting error and penalty caused by change-points. Large λ yields low deviation of xi, thus low false positive rate but at the cost of low true positive rate, and vice-versa. Because of the page limit, the reader is referred to our earlier work [14] for the detailed explanation of this criterion, the efficient algorithm to solve this problem, and the strategy to select the regularization parameter λ.
B. Common copy number variation call
The common CNV detection is considered in the context of source separation. The source separation aims to extract individual sources from their linear or nonlinear mixtures. The most suitable model for our problem is the instantaneous mixture [16], which models the data as the weighted-sum of sources:
| (2) |
where wjm denotes the contribution of the j-th source sj in the m-th mixture data xm. This can be written in matrix form as:
| (3) |
where contains the M mixture data; contains the J sources; and is the contribution matrix.
Suppose a population contains M samples X that derive from J ethnic groups S, the model (3) characterizes the blood mixing procedure. By factorizing X into S and W , the pure blood can be found as the J sources, and the weights of each pure blood in the mix blood form the contribution matrix W , which can be further used for population clustering systematically.
When sj’s are assumed statistically independent, famous algorithms like independent component analysis (ICA) [19] can be employed to estimate source matrix S and contribution matrix W . However, ICA may yield negative S and W , which is mathematically sound but not biologically meaningful. As another approach, given non-negative matrix X and non-negative constraint on both S and W , the factorization (3) is known as non-negative matrix factorization (NMF) [15]. From the applications of NMF in the image processing and documents mining, Lee and Seung [17] showed that NMF can learn the common information from the mixture data. Based on this property of NMF, the common CNVs can be found from those detected from each individual sample.
Lee and Seung [17] proposed a multiplicative update algorithm to solve (3):
which is simple to implement. However, the convergence of this method is not fast enough for our sequence data which has high dimensionality. Therefore, an alternative algorithm based on projected gradient [20] was used in our study.
III. Results
We downloaded the aligned sequencing data (BAM file) of chromosome 21 of six samples from the 1000 Genomes Project. These six samples include a CEU trio of European ancestry (NA12878-daughter, NA12891-father and NA12892-mother) and a YRI trio of Yoruba Nigerian ethnicity (NA19238-mother, NA19239-father and NA19240-daughter).
For each individual sample, first SAMtools [21] was used to generate the DOC profile from the downloaded BAM file. The window size was set to 1 kbp to reduce the computational burden. Then the method proposed in Sec. II-A was used to detect CNVs. The lower and upper threshold to call a CNV were determined from the histogram of the DOC such that 10% (as CNVs cover approximately 10% of human genome) of short reads falling outside the normal region. The normal region is defined as the interval between the upper and lower threshold with center locations at the peak of the histogram. Fig. 1 shows the detected CNV regions of the six samples within genomic coordinate 40~46 Mbp. We note that each sample of YRI trio has a CNV near genomic coordinate 44.75 Mbp.
Fig. 1.
Detected CNV regions within 40~46 Mbp. The amplitude of each spike represents the DOC value.
Once the CNVs of each individual sample are detected, the DOC of CNV regions were input as the columns of mixture matrix X. Each column corresponds to a sample. Regions without CNV are set to 0. Then we used the NMF code written by Lin [20] to factorize X into S and W . The algorithm was initialized with random positive matric S0 and W0. Since there are two ethnic groups, the parameter J is set to 2. Fig. 2 displays the hierarchical cluster of W , and Fig. 3 displays the two columns of S. The cluster result is consistent with that of Magi et al. [22], which was obtained from chromosome 1, except that the YRI daughter is genomically closer to her mother than her father. Interestingly, Fig. 2 shows that source s1 (first column in S) has higher contribution in the YRI trio compared with the CEU trio (right half of w1 is ‘hotter’ than the left half). By comparing s1 with s2 in Fig. 3, we found that s1 has a significant CNV located near 44.75 Mbp, indicating that this CNV is a common CNV that can significantly differentiate CEU trio and YRI trio. To verify this result, the DOCs of the six individual samples are shown in Fig. 4. It is clear that all the DOCs of YRI trio have peaks at location 44.75 Mbp, while those of CEU trio do not.
Fig. 2.
Cluster of the contribution matrix W . The two rows labeled w1 and w2 represent the weights of sources s1 and s2.
Fig. 3.
First/Second column (upper/lower penal) of source matrix S within 40~46 Mbp.
Fig. 4.
The DOCs of six individual samples within 40~46 Mbp.
IV. Conclusion
We have proposed a method that can discover common CNVs based on source separation technique (i.e., NMF). It is shown that using information from common CNVs are significant in the clustering of different ethnic groups. Our analysis on real sequencing data from two family trio supported our method and demonstrate the potential of the method in uncovering the genetic causes of the evolution.
The proposed method is not constrained to classify only two ethnic groups as demonstrated in the Results section. The parameter J controls the number of ethic groups. However, for blind clustering, the choose of J remains an open question.
It’s worth noting that two related works were published by Magi et al. [22] and Klambauer et al. [23] recently. Similar to our method, both of their proposed methods, namely JointSLM and cn.MOPS, used multiple samples to detect CNVs. But their methods are appropriate under different conditions. The former focuses on the detection of common CNVs that are recurrent at the same location, while the latter intends to significantly reduce the false positives using the information introduced by multiple samples. Magi et al. also presented the clustering approach based on CNVs. They clustered the columns of mixture matrix X directly. For large sample size, this method is not applicable because of high dimensionality of X. Instead, our proposed method cluster the weight matrix W , which can not only significantly reduce the data dimensionality but also reduce the variations in the data, resulting in better analysis.
The future studies include the following goals: (1) to employ the proposed method to whole genome analysis. In the current elementary study, to reduce the computation, only the chromosome 21 was processed and displayed, since it is the shortest human chromosome; (2) to compare with other approaches such as JointSLM [22] and cn.MOPS [23]; and (3) to validate the method with more samples in the study of evolutionary genetics.
Footnotes
This work was partially supported by NSF and NIH grant.
References
- [1].Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng H-W. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics. 2011 Aug;27(15):2031–2037. doi: 10.1093/bioinformatics/btr319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443–451. doi: 10.1038/nrg2986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods. 2009 Nov;6:13–20. doi: 10.1038/nmeth.1374. [DOI] [PubMed] [Google Scholar]
- [4].Redon R, et al. Global variation in copy number in the human genome. Nature. 2006 Nov;444(7118):444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee Y-H, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King M-C, Skuse D, Geschwind DH, Gilliam TC, Ye K, Wigler M. Strong association of de novo copy number mutations with autism. Science. 2007 Apr;316:445–449. doi: 10.1126/science.1138659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Stefansson H, et al. Large recurrent microdeletions associated with schizophrenia. Nature. 2008 Sep;455:232–236. doi: 10.1038/nature07229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Roveletrrière A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D. APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet. 2006 Jan;38(1):24–26. doi: 10.1038/ng1718. [DOI] [PubMed] [Google Scholar]
- [8].Campbell PJ, Stephens PJ, Pleasance ED, O’Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 2008 Jun;40:722–729. doi: 10.1038/ng.128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C. Copy number variation: new insights in genome diversity. Genome Res. 2006 Aug;16:949–961. doi: 10.1101/gr.3677206. [DOI] [PubMed] [Google Scholar]
- [10].Chiang DY, Getz G, Jaffe DB, O’Kelly MJT, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods. 2009 Jan;6:99–103. doi: 10.1038/nmeth.1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009 Sep;19:1586–1592. doi: 10.1101/gr.092981.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics. 2009;10:80. doi: 10.1186/1471-2105-10-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Duan J, Zhang J-G, Deng H-W, Wang Y-P. Tech. Rep. Tulane University; 2012. Comparative studies of copy number variation detection methods for next generation sequencing technologies. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Duan J, Zhang J-G, Lefante J, Deng H-W, Wang Y-P. Detection of copy number variation from next generation sequencing data with total variation penalized least square optimization; IEEE international conference on bioinformatics and biomedicine workshops; Atlanta, GA, USA. Nov. 2011.pp. 3–12. [Google Scholar]
- [15].Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis. 2006:155–173. [Google Scholar]
- [16].O’Grady PD, Pearlmutter BA, Rickard ST. Survey of sparse and non-sparse methods in source separation. Int. J. Imag. Syst. Tech., special issue on Blind Source Separation and De-convolution in Imaging and Image Processing. 2005;15(1):18–33. [Google Scholar]
- [17].Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999 Oct;401(6755):788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- [18].Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008 Nov;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Hyvärinen A. Survey on independent component analysis. Neural Computing Surveys. 1999;2:94–128. [Google Scholar]
- [20].Lin C-J. Projected gradient methods for nonnegative matrix factorization. Neural Computation. 2007;19:2756–2779. doi: 10.1162/neco.2007.19.10.2756. [DOI] [PubMed] [Google Scholar]
- [21].Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Magi A, Benelli M, Yoon S, Roviello F, Torricelli F. Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res. 2011 May;39(10):e65. doi: 10.1093/nar/gkr068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Klambauer G, Schwarzbauer K, Mayr A, Clevert D-A, Mitterecker A, Bodenhofer U, Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012 Feb; doi: 10.1093/nar/gks003. [DOI] [PMC free article] [PubMed] [Google Scholar]




