Skip to main content
European Journal of Human Genetics logoLink to European Journal of Human Genetics
. 2011 Jul 6;19(11):1167–1172. doi: 10.1038/ejhg.2011.103

Data-driven approach to detect common copy-number variations and frequency profiles in a population-based Korean cohort

Sanghoon Moon 1,2, Young Jin Kim 1,2, Chang Bum Hong 1, Dong-Joon Kim 1, Jong-Young Lee 1, Bong-Jo Kim 1,*
PMCID: PMC3198136  PMID: 21731056

Abstract

To date, hundreds of thousands of copy-number variation (CNV) data have been reported using various platforms. The proportion of Asians in these data is, however, relatively small as compared with that of other ethnic groups, such as Caucasians and Yorubas. Because of limitations in platform resolution and the high noise level in signal intensity, in most CNV studies (particularly those using single nucleotide polymorphism arrays), the average number of CNVs in an individual is less than the number of known CNVs. In this study, we ascertained reliable, common CNV regions (CNVRs) and identified actual frequency rates in the Korean population to provide more CNV information. We performed two-stage analyses for detecting structural variations with two platforms. We discovered 576 common CNVRs (88 CNV segments on average in an individual), and 87% (501 of 576) of these CNVRs overlapped by ≥1 bp with previously validated CNV events. Interestingly, from the frequency analysis of CNV profiles, 52 of 576 CNVRs had a frequency rate of <1% in the 8842 individuals. Compared with other common CNV studies, this study found six common CNVRs that were not reported in previous CNV studies. In conclusion, we propose the data-driven detection approach to discover common CNVRs including those of unreported in the previous Korean CNV study while minimizing false positives. Through our approach, we successfully discovered more common CNVRs than previous Korean CNV study and conducted frequency analysis. These results will be a valuable resource for the effective level of CNVs in the Korean population.

Keywords: common copy-number variation, CNV profile, Asian CNV, structural variation

INTRODUCTION

A copy-number variation (CNV) is a duplication or deletion of a DNA segment ranging from a kilobase to several megabases and is relatively common and widespread in the human genome. CNVs are also major sources of genomic variation along with single nucleotide polymorphisms (SNPs).1, 2 In contrast to SNPs, certain CNVs encompass a single gene or a set of genes, and thus duplication or deletion of these CNVs is associated with disease susceptibility and gene dosage.1, 2 In addition, some CNV studies have reported different profiles of population-specific CNVs,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and these stratified CNVs have been used for further studies of natural selection and human adaptation to environmental pressures.15 To perform a study of genomic structure with CNVs, it is essential to accurately ascertain CNVs that consistently occur in the population.15

To date, hundreds of thousands of CNVs from large-scale analyses have been reported using various high-resolution genotyping platforms.3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 These data have been deposited in several public databases such as the Database of Genomic Variants (DGVs; http://projects.tcag.ca/variation/) for normal individuals and the Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (https://decipher.sanger.ac.uk/) for affected individuals. Although there are ethnic differences among certain CNVs of different populations, in practice the proportion of Asians in these data has been relatively small as compared with that of other ethnic groups, such as Caucasian and Yoruba. Hence, it is necessary to conduct more Asian CNV studies to understand the genomic loci that account for phenotypic variation and genetic etiology in populations with Asian ancestry.

Recent reports have suggested that common CNVs that can be typed on existing platforms are unlikely to associate with common human diseases.6 Despite several successes,16, 17, 18 common CNVs seem to be nominal genomic markers for studying complex phenotypes. The number of CNVs used for association analyses has been limited by low resolution, platform specificity and the type of CNV class. Most previous studies extracted CNV information from SNP genotyping chips, which are not optimally designed for CNV discovery. Even with CGH chips, which are designed for detecting structural variation, Park et al7 reported that in comparison with CNVs identified in AK1 ∼40% of CNV calls from one specific platform (Agilent 24M aCGH, Santa Clara, CA, USA) were not detectable on the other (NimbleGen 42M aCGH, Indianapolis, IN, USA) because of platform specificity. Therefore, the discovery of common CNVs is still regarded as a challenging issue for research concerning complex phenotypes.

The purpose of our study was to discover reliable, common CNV regions (CNVRs) and to examine the frequency of CNVs in the Korean population. To achieve our aim, we performed two-stage analyses for detecting structural variations. Because of the high signal-to-noise ratio of SNP genotyping chips, the considerable number of falsely detected CNVs has made it challenging to construct CNVRs. In the first stage, we thus identified highly reliable CNVRs from 100 individuals with both a NimbleGen HD2 3 × 720K aCGH assay and the Affymetrix Genome-Wide Human SNP Array 5.0 (Santa Clara, CA, USA). In the second stage, we carried out frequency analyses of the identified CNVRs in 8842 Korean individuals genotyped with the Affymetrix Genome-Wide Human SNP Array 5.0. Based on our results, this is a reliable approach for detecting common CNVRs and surveying ethnically specific CNV patterns.

Subjects and methods

Study subjects

To undertake a large-scale genome-wide analysis study, 10 038 healthy individuals (aged 40–69 years) who were enrolled in the population-based cohort were genotyped with the Affymetrix Human Genome-Wide Human SNP Array 5.0 as part of the Korean Genome Epidemiology Study (KoGES). Finally, based on genotyping data, 1196 (11.9%) out of 10 038 individuals were excluded in each step of the SNP quality control procedures (sample call rate ≥96%, heterozygosity ≥30%, gender inconsistency check, exclusion of patients with tumors and population stratification check). This data set was composed of 52.7% females.19 The resulting raw intensity data files (CEL files) from the 8842 samples were applied to create normalized log2 intensity ratios.

Characteristics of CNV genotyping platforms

Two different types of genotyping platforms were used to detect CNVs: the NimbleGen HD2 3 × 720K aCGH (dual-channel array platform) and the Affymetrix Genome-Wide Human SNP Array 5.0. (single-channel array platform). The NimbleGen HD2 3 × 720K aCGH provides more than 720 000 probes for the detection of CNVs. Median inter-probe spacing of the backbone is <5 kb. As additional probes, about 38 000 DGV CNVRs and 8599 CNV events (CNVEs) validated by Wellcome Trust Case Control Consortium6 were included. The Affymetrix Genome-Wide Human SNP Array 5.0 contains 500 568 SNP probes and an additional 420 000 CNV probes. Among the CNV probes, 100 000 CNV probes were chosen to cover 2000 CNVs identified in the University of California Santa Cruz. Genome Browser database, and the other 320 000 probes were evenly distributed across the genome.

Discovery of reliable and common CNVs

One reason for false calls of SNP chip is spurious CNVs generated from the noise. Most detection tools raise the threshold for detecting true positives and then detect a small number of CNV. To reduce false positive, however, we used the opposite method. We lowered threshold and detected many putative CNVs. Then, we filtered the false calls using genuine CNVR. To build a reliable CNVR map and carry out a frequency analysis, we performed two-stage analyses for detecting structural variations. In the first stage, we adopted a data-driven CNV detection approach using highly reliable CNVRs discovered in 100 individuals with both the NimbleGen HD2 3 × 720K aCGH assay and the Affymetrix Genome-Wide Human SNP Array 5.0. The 100 individuals were randomly selected from among the 8842 individuals and then were genotyped on the two different platforms using the same DNA isolated from peripheral blood. The Figure 1 and Supplementary Figure S1 show the overall scheme of our CNV analysis. CNV segments detected in each autosomal region from the Affymetrix platform were compared with those from the NimbleGen platform in the same individual. From the NimbleScan v2.5 result of 100 samples, we compiled CNVRs and used them for the genuine CNVR map for further analysis (Supplementary Figure S1). This step was independently conducted for all 100 individuals. Next, we defined both the overlapping region, which occurred in more than two samples among all 100 samples, and a region that was commonly detected in both platforms as a reliable, common CNVR. In the following stage, we carried out a frequency analysis on the discovered CNVRs in the 8842 Korean individuals genotyped with the Affymetrix Genome-Wide Human SNP Array 5.0. (Figure 1 and Supplementary Figure S1). Supplementary Table S1 shows the actual frequency of each of the 576 CNV loci.

Figure 1.

Figure 1

Overall scheme of our CNV analysis. We performed a two-stage analysis to detect structural variations. In the first stage, we discovered highly reliable CNVRs from 100 individuals using two different platforms. The 100 individuals were randomly selected from 8842 individuals from the KoGES and then genotyped on the NimbleGen HD2 3 × 720K aCGH platform and the Affymetrix Genome-Wide Human SNP Array 5.0 using the same DNA isolated from peripheral blood. In the following stage, we carried out frequency analysis on the discovered 576 CNVRs in 8842 Korean individuals.

Signal intensities were extracted from each platform. For the Affymetrix Genome-Wide Human SNP Array 5.0, pre-processing procedures, such as background subtraction, normalization and summarizing probe set were adopted with the apt-probeset-summarize application (Affymetrix Power Tools;http://www.affymetrix.com/partners_programs/programs/ developer/tools/powertools.affx). After pre-processing procedures, the signal intensity ratio between the test and reference samples (two replicates of NA10851 (Coriell), which was from the HapMap cell line DNA) of each probe was transformed to log2 scale with the chromosomal coordinates of the probes (University of California Santa Cruz version hg18/NCBI Build 36). CNV segments were called using the Genome Alteration Detection Analysis (GADA) segmentation algorithm.20 To define the threshold, different threshold Ts were tested from 3 to 8. Finally, we ran the GADA R-package on the 8842 individuals with T=3.5, α=0.2 and MinSegLen=6, which is a relaxed threshold to maximize the number of multiple matches with CNV segments from the NimbleGen platform. For the CNV detection results from two reference replicates (both generated from NA10851 cells), we selected the one with more matches. For the NimbleGen HD2 3 × 720K, all 100 samples passed experimental control metrics, such as chromosome X shift and mad.1dr with NimbleScan v2.5. After quality control procedures, signal intensities for each probe were also extracted from the test and NA10851 cell line DNA. The threshold for defining CNV segments was set to an average log2 ratio of ±0.3.

Validation of CNVRs by TaqMan copy-number assays

To assess the CNV calling of false-positive CNVRs, we randomly selected 20 CNV loci from the 576 CNVRs (13 gains and 7 loss loci) for validation. Table 1 shows characteristics of the 20 validated CNVRs. For each CNVR, we carried out TaqMan Copy Number Assays (Applied Biosystems, Foster City, CA, USA) with 20 pre-designed primers. All experiments were replicated three times to enhance the validation accuracy.

Table 1. Validated CNVRs from TaqMan copy number assays.

Chromosome Start Stop Length State TPa FPb PPVc
Chr1 88868282 89251066 383 kb Gain 100 50 0.667
Chr2 159668015 159669082 1 kb Gain 25 5 0.833
Chr3 163996134 164103516 107 kb Gain 29 1 0.967
Chr4 61621813 61624741 2.9 kb Gain 30 0 1.000
Chr4 186678748 186681053 2.3 kb Gain 28 2 0.933
Chr6 54037057 54041872 4.8 kb Loss 30 0 1.000
Chr6 79577799 79581568 3.8 kb Loss 30 0 1.000
Chr6 74647117 74656877 9.8 kb Loss 124 26 0.827
Chr7 22401397 22403203 1.8 kb Gain 22 8 0.733
Chr7 70058931 70063735 4.8 kb Gain 27 3 0.900
Chr8 112363280 112365400 2.1 kb Loss 25 5 0.833
Chr9 70927986 70933108 5.1 kb Gain 28 2 0.933
Chr9 23348479 23367652 19.2 kb Loss 117 33 0.780
Chr10 31483686 31484714 1 kb Gain 21 9 0.700
Chr13 37955433 37958105 2.7 kb Gain 21 9 0.700
Chr13 49967437 49970106 2.7 kb Loss 29 1 0.967
Chr15 60493430 60494943 1.5 kb Gain 28 2 0.933
Chr18 61917841 61920121 2.3 kb Loss 26 4 0.867
Chr20 1337680 1338783 1.1 kb Gain 26 4 0.867
Chr21 43794757 43797350 2.6 kb Gain 27 3 0.900

Abbreviations: aTP: true positive; bFP, false positive; cPPV, positive predictive value.

PPV: no. of TP/(no. of TP+no. of FP)100.

Results

Discovery of common CNVRs

Comparing autosomal CNV segments detected in the Affymetrix and NimbleGen platforms for 100 randomly selected samples, we found an average of 88 CNV segments in each individual. In total, 8779 segments detected in the 100 individuals were also assigned to CNVRs. Among these segments, the data-driven approach identified 807 common CNVRs that recurred in more than two samples of all 100 samples. Finally, 576 autosomal CNVRs ranging in length from 1 kb to 4.56 Mb were selected (Supplementary Table S1 and Supplementary Figure S3). The mean and median length of these CNVRs was 12.6 and 113 kb, respectively (Supplementary Table S4). Supplementary Table S3 and supplementary Figure S4 show the number of CNVs that affected genes. The median and the mean number of probe included in break points was 33.5 and 69.5, respectively (Supplementary Figure S2). Furthermore, we examined the CNV state of each of the 576 regions to determine whether it was a copy-number gain, loss or complex (gain and loss). Moreover, the CNV type of the 576 CNVRs was also determined to be either common or rare by the 1% frequency rate.

Frequency analysis of CNVRs

To establish CNV profiles, we examined the frequency of occurrence in our detected CNVRs by studying their occurrence in the total cohort of 8842 individuals screened onto Affymetrix 5.0. arrays. Figure 2 shows the distribution of the 576 CNVRs by frequency rate. Most (91% 524 of 576 CNVRs) of the detected CNVRs had a >1% frequency rate in the 8842 individuals. We suggest that these results support our detection strategy for common CNVs and that there may be additional common CNVRs in Koreans. Interestingly, the majority of the 576 CNVRs (223 regions) had frequency rates of 1–5%. Interestingly, 47 regions (8% of all CNVRs detected) varied in more than 50% of all 100 cases.

Figure 2.

Figure 2

CNV frequency rates for the 576 detected CNVRs in 8842 Korean individuals. (a) We surveyed the frequency rate of each region. Of the 576 CNVRs, 223 had frequency rates of 1–5%, and 52 CNVRs had frequency rates of <1% (rare variant). (b) Frequency rates of the detected CNVRs were divided into gains and losses.

Comparison with previously reported common CNVRs

Figure 3 shows results from the cascade approach comparing our CNVRs with previously reported common CNVRs. Most (87% 501 of 576) of the total detected CNVRs overlapped by ≥1 bp with 8343 autosomal CNVEs.6 Comparing the 501 overlapping CNVRs with the results of Conrad et al,6 51 of these CNVRs had a frequency rate of <1%. Of the remainder (75 of 576), only 37 CNVRs overlapped with the results of common Asian CNVRs including 10 Korean samples genotyped on the Agilent 24 M platform.7 Concerning the 38 non-overlapping CNVRs, we examined their concurrence with common CNVRs from 270 HapMap samples genotyped on the Affymetrix Human Genome-wide Human SNP array 6.0.3 Most (33 of 38) did not concur with the results of McCarroll et al.3

Figure 3.

Figure 3

Comparison of results with previously reported common CNVRs. Detected CNVRs (576) were compared with three well-defined CNVR studies using three kinds of platforms. Most (87% 501 of 576) of the detected CNVs overlapped with 8343 autosomal CNVEs with the NimbleGen platform.6 Regarding the remainder (75 of 576), 37 CNVRs overlapped with the results of a study examining Asian common CNVRs with the Agilent platform.7 For the 38 non-overlapping CNVRs, we examined their concurrence with the common CNVRs from 270 HapMap samples with the Affymetrix platform.3 Most (33 of 38) did not concur with the results of McCarroll et al.3

We compared these 33 regions with all CNVRs listed in the DGV in more detail (Table 2). Most CNVRs (26 of 33) overlapped with either CNVs or CNVRs from 23 different CNV studies.4, 5, 6, 9, 10, 11, 12, 13, 14, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 In particular, 12 of 33 CNVRs matched CNVEs identified by Conrad et al,6 and these were not listed as validated CNVEs. Consequently, we found seven unlisted CNVRs in the DGV. However, when the 33 CNVRs were compared with the CNVRs described by Yim et al,8 eight regions overlapped. Interestingly, one region among the eight CNVRs overlapped with the seven unlisted CNVRs. In total, we found six common, previously unreported CNVRs. When we compared our CNVRs with union set of CNVRs from four different studies, we also observed the same results as those of cascade approach.

Table 2. These are 33 common CNVRs not described as CNV loci in the studies by Conrad et al6, Park et al7 and McCaroll et al3.

Chr Start End References for overlapping CNVs in DGV Frequency (%)
        Gain Loss
Chr2a 3803336 3952526 McKernan et al21 79.9 0.1
Chr2 223468381 223471352 Bentley et al,22 Korbel et al,23 Levy et al,24 McKernan et al,21 Wang et al,11 Wheeler et al25 0.0 1.3
Chr3a 4234309 4252262 Itsara et al,10 Redon et al,4 Shaikh et al,9 Zogopoulos et al26 0.5 18.1
Chr4 190062985 190128452 Conrad et al,6 Zogopoulos et al26 1.0 0.7
Chr5 1785959 1822301 No overlapping CNVs described, indel, inversion 18.0 0.0
Chr7a 11278981 11314260 Conrad et al,27 Matsuzaki et al,28 Redon et al4 0.1 13.6
Chr8a 92185630 92249955 McKernan et al,21 Pinto et al,29 Redon et al4 0.4 21.4
Chr8 145820725 145821995 Wong et al14 3.1 0.0
Chr9 38189356 38215032 Redon et al,4 Wong et al14 26.5 0.0
Chr9 136496422 136511685 Conrad et al,6 Levy et al,24 Wang et al,30 Wong et al14 20.5 5.7
Chr9 136881046 136882910 Conrad et al6 3.6 0.0
Chr10 134015916 134026670 Conrad et al,6 Itsara et al10 4.1 0.1
Chr11 45080630 45089496 No overlapping CNVs described, indel 22.2 0.0
Chr12 131955341 131960337 McKernan et al,21 Matsuzaki et al28 1.8 0.1
Chr13a 54839168 54850687 No overlapping CNVs described, indel 0.1 11.5
Chr15 83262590 83272701 Itsara et al,10 Jakobsson et al31 1.6 0.1
Chr17a 113372 119301 Kidd et al5 7.1 0.0
Chr17 2054397 2149701 No overlapping CNVs described, indel 0.4 16.0
Chr17a 78371877 78373883 Conrad et al6 2.8 0.1
Chr18 75410941 75413108 Ahn et al,32 Conrad et al,6 Levy et al,24 Wang et al,11 Wong et al14 2.5 0.7
Chr18 75667308 75669900 Conrad et al6 2.6 0.3
Chr19 311044 312669 Conrad et al,6 Mills et al,33 Perry et al,12 Redon et al4 1.8 0.2
Chr19 355959 374597 Conrad et al,6 Jakobsson et al,31 Wang et al30 1.8 0.2
Chr19 985108 996171 Conrad et al,6 Locke et al,34 McKernan et al,21 Perry et al,12 Wong et al14 1.6 0.2
Chr19 1578740 1580117 Conrad et al,6 de Smith et al13 1.8 0.2
Chr19 5201759 5219582 No overlapping CNVs described, indel 1.6 0.0
Chr19 44503723 44527789 Conrad et al6 1.2 0.0
Chr19 48834740 48836125 Conrad et al6 0.9 0.0
Chr19a 58605551 58633953 Ahn et al,32 Itsara et al,10 Jakobsson et al,31 Matsuzaki et al,28 McKernan et al,21 Redon et al,4 Wang et al30 1.0 2.8
Chr20 6917697 6923002 No overlapping CNVs described, indel 0.0 2.8
Chr20 61310499 61325605 Itsara et al,10 Jakobsson et al,31 Kim et al,35 Perry et al,12 Shaikh et al,9 Wang et al,30 Wong et al14 2.7 0.4
Chr21 42147413 42153194 No overlapping CNVs described 1.0 0.2
Chr21 45224142 45226832 de Smith et al,13 Redon et al4 4.4 0.0
a

CNVRs overlapping with those of Yim et al8.

Gene ontology analysis of CNVRs

We surveyed RefSeq genes partially or entirely encompassing 576 CNVRs and found 629 RefSeq genes. To assess the functional implications of these CNVRs, we conducted gene ontology analysis using the Database for Annotation, Visualization and Integrated Discovery (DAVID) functional annotation tool36 and the PANTHER classification system.37 Table 3 and Supplementary Table S2 show gene ontology results from these two analysis tools. Genes involved in sensory perception, cognition, neurological system processes, defense responses and immune responses were mainly included in the DAVID results. Similarly, PANTHER results showed genes involved in metabolic processes, cellular communication, immune system processes and responses to stimuli.

Table 3. Gene ontology analysis results of genes overlapping with 576 CNVRs using the DAVID functional annotation tool.

Annotated function % P-value FDR
Sensory perception of smell 7.01 3.38E-10 5.51E-07
Sensory perception of chemical stimulus 7.18 2.16E-09 1.76E-06
G-protein-coupled receptor protein signaling pathway 11.11 7.23E-07 3.93E-04
Sensory perception 8.55 3.37E-06 1.38E-03
Cognition 9.06 8.21E-06 2.68E-03
Neurological system process 11.11 8.78E-06 2.39E-03
Defense response 6.67 2.84E-05 6.61E-03
Cell surface receptor-linked signal transduction 14.53 1.06E-04 2.15E-02
Defense response to bacterium 2.22 1.57E-04 2.80E-02
Antigen processing and presentation 1.88 2.10E-04 3.37E-02
Antigen processing and presentation of peptide antigen 1.03 1.42E-03 1.91E-01
Antigen processing and presentation of peptide antigen by MHC class I 0.85 1.49E-03 1.83E-01
Digestion 1.71 1.81E-03 2.04E-01
Antigen processing and presentation of peptide or polysaccharide antigen by MHC class II 1.03 3.03E-03 2.98E-01
Response to bacterium 2.39 6.36E-03 5.01E-01
Homophilic cell adhesion 1.88 6.85E-03 5.04E-01
Epithelial cell differentiation 1.88 9.28E-03 5.91E-01
Keratinization 1.03 9.60E-03 5.83E-01
Immune response 5.64 1.22E-02 6.52E-01
Keratinocyte differentiation 1.20 1.52E-02 7.13E-01
Response to virus 1.54 1.84E-02 7.64E-01
Epidermal cell differentiation 1.20 2.24E-02 8.14E-01
Biological adhesion 5.47 2.43E-02 8.26E-01
Cell adhesion 5.47 2.47E-02 8.17E-01
Neurotransmitter transport 1.20 4.12E-02 9.36E-01
Cell–cell adhesion 2.56 4.41E-02 9.41E-01
Transcription initiation from RNA polymerase II promoter 1.03 5.64E-02 9.70E-01
Exocytosis 1.37 6.23E-02 9.76E-01
RNA elongation 0.85 6.97E-02 9.83E-01
Complement activation, alternative pathway 0.51 7.51E-02 9.86E-01
Endothelial cell differentiation 0.51 8.42E-02 9.90E-01
Epithelium development 2.05 8.85E-02 9.91E-01
Binding of sperm to zona pellucida 0.51 9.36E-02 9.92E-01
Sperm–egg recognition 0.51 9.36E-02 9.92E-01

Abbreviation: FDR, false discovery rate (Benjamini and Hochberg method).

FDRs that are <0.05 are shown in boldface.

Assessment of accuracy of detected CNVRs by experimental validation

To assess the accuracy for our CNV calling strategy, we carried out TaqMan Copy Number Assays (Applied Biosystems) on 20 randomly selected CNV loci among the 576 CNVRs. Table 1 shows characteristics of the 20 validated CNVRs. As a result of our experiment, we defined the positive predictive value (PPV) as the proportion of CNVs with positive test results, which were correctly predicted. And we applied this statistic as the measurement standard of accuracy. The average PPV of our validation test was 0.886 (Table 1).

Discussion

The high noise level of the signal intensity is one of the drawbacks of CNV studies, especially those that use SNP arrays. It is not easy to filter out noise during the detection process. To dissect noise from signals, a parameter or a cutoff on the CNV calling algorithm is changed repeatedly to select an optimal value. During this process, many legitimate signals are poorly considered, and thus a significant number of legitimate samples are pruned. For example, although 670 and 1098 CNVs on average in an individual were reported from two previous studies using an ultra high-resolution array, in most CNV studies using SNP arrays, the number of average CNVs in each individual has not exceeded 50.8, 9, 10, 30, 38

Recently, Yim et al8 discovered a set of CNVRs from 3578 of the 8842 under the KoGES samples. However, they demonstrated 40.3 CNVs per genome after analyzing 3578 individuals. Despite the fact that a subset of identical samples was used for the analysis, only a few common CNVs were detected because of the limited capability of SNP genotyping chips and the stringent threshold criteria for CNV calling. In practice, Yim et al8 reported 656 common CNVRs among 4003 CNVRs in their study.

In our current study, we performed the data-driven detection approach to discover common CNVRs including those of previously unreported in Yim et al8 and then confirmed the frequency rates in these regions using more expanded samples than Yim et al.8

We also compared our results with previously reported common CNVRs. In all, 87% of detected CNVs (501 of 576) overlapped with previously validated CNVEs by Conrad et al.6 However, in comparison with previously reported common Asian CNVs7 to determine whether the discrepancies originated from Korean-specific CNVs, 64% of these CNVs (371 of 576) were shared in each study. We believe that platform-dependent CNVs may result in a lower match rate than in Conrad et al6 because the NimbleGen HD2 3 × 720K aCGH assay contains targeted CNV probes for the 8599 CNVEs described by Conrad et al (see characteristics of CNV genotyping platforms in the Subjects and methods).6 Consequently, we found six previously unreported CNVRs. However, we cannot conclude that these CNVRs are Korean specific because of the highly variable discrepancy among genotyping platforms.

We then surveyed the frequency rate of each region and determined whether the region was actually common in the population. As a result, 223 of 576 CNVRs had frequency rates of 1–5%, and 47 CNVRs occurred in >50% of individuals. Interestingly, 52 of 576 CNVRs had a frequency rate of <1% in 8842 individuals. Moreover, 51 of 52 regions overlapped with those reported by Conrad et al,6 as described above.

There are some limitations associated with our approach. First, as we mainly focused on common CNVRs, we could not address the presence of rare CNVs. Second, despite our efforts to solve the noise problem, we still ascertained only a part of the CNVs. Thus, further studies are required to complete the Korean CNV profile.

Nevertheless, we propose a reliable approach for detecting common CNVRs. As only 44% (251 of 576) of CNVRs matched those of Yim et al,8 the CNVRs from the two studies are complementary to each other. Therefore, further analysis combining the data from Yim et al8 and the data in our study will be a substantial resource for mapping structural variants in the Korean population. Moreover, such a study will give an extended map of CNV markers and contribute to create an association map of genomic loci that accounts for the variation of thousands of phenotypes available in the KoGES.

Acknowledgments

This work was supported by an intramural grant from the Korea National Institute of Health (2009-N00469-00, 2010-N73001-00) and grants from Korea Centers for Disease Control and Prevention (4845-301, 4851-302, 4851-307). We are grateful to Professor Yeun-Jun Chung of the Integrated Research Center for Genome Polymorphism, Catholic University of Korea, for making two replicates of reference genotype data (NA10851) available.

The authors declare no conflict of interest.

Footnotes

Supplementary Information accompanies the paper on European Journal of Human Genetics website (http://www.nature.com/ejhg)

Supplementary Material

Supplementary Information

References

  1. Freeman JL, Perry GH, Feuk L, et al. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–961. doi: 10.1101/gr.3677206. [DOI] [PubMed] [Google Scholar]
  2. Choy KW, Setlur SR, Lee C, et al. The impact of human copy number variation on a new era of genetic testing. BJOG. 2010;117:391–398. doi: 10.1111/j.1471-0528.2009.02470.x. [DOI] [PubMed] [Google Scholar]
  3. McCarroll SA, Kuruvilla FG, Korn JM, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008;40:1166–1174. doi: 10.1038/ng.238. [DOI] [PubMed] [Google Scholar]
  4. Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. doi: 10.1038/nature06862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Conrad DF, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2009;464:704–712. doi: 10.1038/nature08516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Park H, Kim J, Ju YS, et al. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat Genet. 2010;42:400–405. doi: 10.1038/ng.555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Yim SH, Kim TM, Hu HJ, et al. Copy number variations in East-Asian population and their evolutionary and functional implications. Hum Mol Genet. 2010;19:1001–1008. doi: 10.1093/hmg/ddp564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Shaikh TH, Gai X, Perin JC, et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res. 2009;19:1682–1690. doi: 10.1101/gr.083501.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Itsara A, Cooper GM, Baker C, et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet. 2009;84:148–161. doi: 10.1016/j.ajhg.2008.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Wang J, Wang W, Li R, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. doi: 10.1038/nature07484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Perry GH, Ben-Dor A, Tsalenko A, et al. The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet. 2008;82:685–695. doi: 10.1016/j.ajhg.2007.12.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. de Smith AJ, Tsalenko A, Sampras N, et al. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex disease. Hum Mol Genet. 2007;16:2783–2794. doi: 10.1093/hmg/ddm208. [DOI] [PubMed] [Google Scholar]
  14. Wong KK, deLeeuw RJ, Dosanjh NS, et al. A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet. 2007;80:91–104. doi: 10.1086/510560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Wu LY, Chipman HA, Bull SB, et al. A Bayesian segmentation approach to ascertain copy number variations at the population level. Bioinformatics. 2009;25:1669–1679. doi: 10.1093/bioinformatics/btp270. [DOI] [PubMed] [Google Scholar]
  16. Gonzalez E, Kulkarni H, Bolivar H, et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science. 2005;307:1434–1440. doi: 10.1126/science.1101160. [DOI] [PubMed] [Google Scholar]
  17. Fanciulli M, Norsworthy PJ, Petretto E, et al. FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific autoimmunity. Nat Genet. 2007;39:721–723. doi: 10.1038/ng2046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Perry GH, Dominy NJ, Claw KG, et al. Diet and the evolution of human amylase gene copy number variation. Nat Genet. 2007;39:1256–1260. doi: 10.1038/ng2123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Cho YS, Go MJ, Kim YJ, et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41:527–534. doi: 10.1038/ng.357. [DOI] [PubMed] [Google Scholar]
  20. Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008;24:309–318. doi: 10.1093/bioinformatics/btm601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. McKernan KJ, Peckham HE, Costa GL, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009;19:1527–1541. doi: 10.1101/gr.091868.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. doi: 10.1126/science.1149504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Levy S, Sutton G, Ng PC, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:3254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Wheeler DA, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]
  26. Zogopoulos G, Ha KC, Naqib F, et al. Germ-line DNA copy number frequencies in a large North American population. Hum Genet. 2007;122:345–353. doi: 10.1007/s00439-007-0404-5. [DOI] [PubMed] [Google Scholar]
  27. Corad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet. 2005;38:75–81. doi: 10.1038/ng1697. [DOI] [PubMed] [Google Scholar]
  28. Matsuzaki H, Wang PH, Hu J, Rava R, Fu GK. High resolution discovery and confirmation of copy number variants in 90 Yoruba Nigerians. Genome Biol. 2009;10:R125. doi: 10.1186/gb-2009-10-11-r125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pinto D, Marshall C, Feuk L, Scherer SW. Copy-number variation in control population cohorts. Hum Mol Genet. 2007;16:R168–R173. doi: 10.1093/hmg/ddm241. [DOI] [PubMed] [Google Scholar]
  30. Wang K, Li M, Hadley D, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008;451:998–1003. doi: 10.1038/nature06742. [DOI] [PubMed] [Google Scholar]
  32. Ahn SM, Kim TH, Lee S, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–1629. doi: 10.1101/gr.092197.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Mills RE, Luttig CT, Larkins, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. doi: 10.1101/gr.4565806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Locke DP, Sharp AJ, McCarroll SA, et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet. 2006;79:275–290. doi: 10.1086/505653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kim J, Ju YS, Park H, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011–1015. doi: 10.1038/nature08211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  37. Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD. PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res. 2010;38:D204–D210. doi: 10.1093/nar/gkp1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Glessner JT, Wang K, Cai G, et al. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature. 2009;459:569–573. doi: 10.1038/nature07953. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

Articles from European Journal of Human Genetics are provided here courtesy of Nature Publishing Group

RESOURCES