Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Mar 26.
Published in final edited form as: Mol Psychiatry. 2013 Aug 13;18(11):1178–1184. doi: 10.1038/mp.2013.98

Detecting Large Copy Number Variants Using Exome Genotyping Arrays In a Large Swedish Schizophrenia Sample

Jin P Szatkiewicz 1, Benjamin M Neale 2,3, Colm O'Dushlaine 3, Menachem Fromer 3,5, Jacqueline I Goldstein 2, Jennifer L Moran 3, Kimberly Chambert 3, Anna Kähler 4, Patrik KE Magnusson 4, Christina M Hultman 4, Pamela Sklar 5, Shaun Purcell 2,5, Steven A McCarroll 3,6, Patrick F Sullivan 1,4,*
PMCID: PMC3966073  NIHMSID: NIHMS546255  PMID: 23938935

Abstract

Although copy number variants (CNVs) are important in genomic medicine, CNVs have not been systematically assessed for many complex traits. Several large rare CNVs increase risk for schizophrenia (SCZ) and autism and often demonstrate pleiotropic effects; however, their frequencies in the general population and other complex traits are unknown. Genotyping large numbers of samples is essential for progress. Large cohorts from many different diseases are being genotyped using exome-focused arrays designed to detect uncommon or rare protein-altering sequence variation. Although these arrays were not designed for CNV detection, the hybridization intensity data generated in each experiment could, in principle, be used for gene-focused CNV analysis. Our goal was to evaluate the extent to which CNVs can be detected using data from one particular exome array (the Illumina Human Exome Bead Chip). We genotyped 9, 100 Swedish subjects (3, 962 cases with SCZ and 5, 138 controls) using both standard GWAS arrays and exome arrays. In comparison to CNVs detected using GWAS arrays, we observed high sensitivity and specificity for detecting genic CNVs ≥400 kb including known pathogentic CNVs along with replicating the literature finding that cases with SCZ had greater enrichment for genic CNVs. Our data confirm the association of SCZ with 16p11.2 duplications and 22q11.2 deletions and suggest a novel association with deletions at 11q12.2. Our results suggest the utility of exome focused arrays in surveying large genic CNVs in very large samples; and thereby open the door for new opportunities such as conducting well-powered CNV assessment and comparisons between different diseases. The use of a single platform also minimizes potential confounding factors that could impact accurate detection.

Keywords: schizophrenia, copy number variation, structural variation, genotyping, Illumina, exome array

Introduction

The assessment of structural variation, particularly copy number variation (CNV), has led to important insights into the causes of many genomic disorders(1).CNVs are known to play a role in the etiology of many complex traits, particularly neuropsychiatric disorders like schizophrenia, autism, and developmental delay(25). Indeed, CNV evaluation has become a routine diagnostic test in medical genetics, including the evaluation of childhood-onset neurobehavioral disorders.

Although CNVs are of clear importance in genomic medicine, our knowledge base has important deficiencies. For most rare CNVs, large numbers of samples have not been assessed in order to establish population frequencies and thereby enabling precise effect size estimation. The impact of many CNVs is pleiotropic; quantifying their phenotypic impact requires evaluation of multiple large case groups. For example, deletion or duplication of 16p11.2 was initially associated with autism (6), and subsequent studies expanded the phenotype to include mental retardation, epilepsy, developmental delay, schizophrenia, and alterations in brain and body size(2, 79). In some instances, large samples were evaluated with different technologies resulting in uncertainty in combining results. Most importantly, CNVs have not been systematically assessed for many complex traits, and their role is unknown for many biomedical diseases.

As many CNVs important in genomic medicine are uncommon or rare, genotyping large numbers of subjects is essential. Exome-focused SNP arrays (“exome chips”) are being used to analyze the genomes of more than one million research subjects to type the protein-altering sequence variation that is segregating at low allele frequencies. Although such arrays were designed to interrogate sequence variation, they could in principle also be used as a gene-centric CNV array. If so, this would enable CNV studies in large sample sizes along with well-powered comparisons between different diseases. Importantly, it would also take advantage of a moment in time at which many large cohorts are being scanned on a common platform within a relatively short time window thereby minimizing platform and batch variation that can complicate CNV meta-analysis. Thus, the goal of this report was to evaluate the capacity of the Illumina Human Exome BeadChipsto detect CNVs.

Subjects and Methods

Subjects

The parent study (10) is described in details in the Supplemental Methods. All procedures were approved by ethical committees at the Karolinska Institutet in Sweden and at the University of North Carolina at Chapel Hill in the US, and all subjects provided written informed consent (or legal guardian consent and subject assent). Cases with schizophrenia (N = 5, 351) were identified via the Swedish Hospital Discharge Register, which captures all public and private inpatient hospitalizations. The validity of the case definition of schizophrenia is strongly supported. Controls (N = 6, 509) were selected at random from Swedish population registers. DNA was extracted from all 11, 850 Swedish subjects from peripheral blood samples using Qiagen technologies at the Karolinska Institutet Biobank.

Exome array genotyping

DNA samples were sent to the Broad Institute Genetic Analysis Platform for genotyping, are placed on 96-well plates for processing using the Illumina Infinium HumanExome BeadChip v1.0. Majority of Exome genotypes were called using GenomeStudio v2010.3 with the calling algorithm/genotyping module version 1.8.4 using the custom cluster file StanCtrExChp_CEPH.egt, subsequent processing of genotype calling was done by zCall(11). The Broad Institute did not filter any SNPs based off of technical quality control metrics. Only samples passing an overall call rate of 98% criteria and standard identity check were released.

GWAS array genotyping

DNA samples were genotyped in six batches (Sw1-6) at the Broad Institute using Affymetrix 5.0 (3.9%), Affymetrix 6.0 (38.6%), and Illumina OmniExpress (57.4%) arrays according to the manufacturers’ protocols (Table S1). Genotype calling and quality control was done in four sets corresponding to data from Affymetrix 5.0 (Sw1), Affymetrix 6.0 (Sw2-4), and the OmniExpress batches (Sw5, Sw6). Genotypes were called using Birdsuite (Affymetrix) or BeadStudio (Illumina). SNP-based quality control filters (subject QC I) were applied to exclude subjects using the following criteria: SNP missingness ≥0.05 (before sample removal); subject missingness ≥0.02; autosomal heterozygosity deviation; and a randomly selected member of any pair of subjects with high relatedness. The relatedness testing was done with PLINK (12)using 39, 239 suitable SNPs (Supplemental Methods) and high relatedness was defined using π˜ >0.2. After quality control, a total of 11, 224 subjects (5, 001 cases with SCZ and 6, 243 controls) remained and were used for subsequent CNV calling and analysis.

CNV calling

All genomic locations are given in NCBI build 37/UCSC hg19 coordinates. To detect CNVs from Ilumina exome arrays, we applied PennCNV (June 2011 version) (13) with customized parameters. PennCNV is a robust method designed for Illumina arrays (14), and used here to facilitate use of our pipeline by other investigators. A step-by-step instruction on how to use PennCNV for exome arrays along with parameters used in this study are provided in the Supplemental Methods. Key information on preparation of input parameters is summarized below. PennCNV applies a hidden Markov model (HMM) to segment the normalized and transformed probe intensities (i.e., log R ratios (LRR) and B allele frequencies (BAF)) and incorporates SNP allele frequencies (i.e., population frequency of the B allele (PFB)). Our general approach to identifying customized PennCNV parameters is based on HMM theory and computational experiments designed for optimization. For baseline HMM transition probabilities for probes within 5 kb, we used parameters designed for Illumina GWAS arrays. To account for the skewed distribution of exome array probes, the transition probabilities were set to depend on inter-probe distance (i.e., transitioning to a different copy number state is penalized less when probes are farther apart). For HMM emission probabilities, the initial values were estimated empirically from 12 known large CNVs, through manual examination of corresponding probe intensities on exome arrays (Figure S3) and simple linear interpolation (13).For optimization of emission probabilities, we tested a series of parameter values using a large set of training samples independent of the final subjects used this study. The PFB values for all SNPs are compiled from a large number of control individuals using the “compile_pfb.pl” script in PennCNV (13).SNP with a PFB value of 0 are treated as non-polymorphic markers by setting its PFB to 2.

To detect CNVs from GWAS arrays, we applied two established methods, the Birdseye tool in Birdsuite(15)and PennCNV (June 2011 version) (13). Birdseye was initially developed for Affymetrix platform and was then extended to handle Illumina GWAS arrays (14, 16).PennCNV was initially developed for Illumina platform and was then extended to handle Affymetrix GWAS arrays (14, 16). Birdseye uses a HMM to normalized probe intensities for each allele, using model priors (i.e. allele-specific probe responses) generated separately for each type of GWAS arrays. PennCNV applies a HMM to LRR and BAF, using default program parameters recommended for each type of GWAS arrays. Additional computational details such as input file preparation can be found in Supplementary Methods.

CNV quality control

Quality control steps for initial CNV call sets are described in details in Supplemental Methods, and were used to remove low confidence CNVs or CNVs overlapping large genomic gaps. Additional subject quality control (subject QC II)was conducted to exclude subjects with high probe intensity variance or excessive CNVs (> 95th percentiles). As the CNV detection algorithms (PennCNV and Birdseye) were optimized to identify rare CNVs, we extracted autosomal CNVs present in < 0.01 all post-QC subjects.

In sillico sex-mixing experiments for exome array

For initial evaluation of our CNV calling protocol for exome array, we simulated heterozygous deletions using a sex-mixing approach(15).After excluding the pseudo-autosomal regions (in order to remove natural copy number variation), we used a total 5, 041 chromosome X probes from the exome array. Among the 9, 100 Swedish samples that passed QC, we selected samples with a standard deviation of LRR <0.35 for the chrX probe intensities to form the simulation pool. To estimate specificity, we created 1000 simulated independent samples of normal copy number. In each simulation, a female sample was randomly chosen and its intensities of all chrX probes were randomly permuted to create a “CNV-free” chromosome. After applying the CNV calling protocol, any CNV detected from such chromosome was regarded as ‘false positive’. To estimate sensitivity, we created 1000 simulated independent samples, where each contains heterozygous deletions spanning variable numbers of (N) consecutive probes at known locations. Each simulation was generated as following: (1) randomly choose a female sample; (2) randomly choose a male sample; (3) randomly permute intensities of all chrX probes to create a “CNV-free” chromosome; (4) replace consecutive N probes from the female sample with the intensities of the corresponding probes from a male sample to create a virtual sample with pseudo-deletions at known locations. After applying the CNV calling protocol, sensitivity was computed by the proportion of “true positive” deletions of all predicted deletions.

Analyses of human CNV data

Autosomal CNVs were used for all analyses. Large CNVs with previously reported associations with schizophrenia were specifically examined(16, 18). To explore novel loci from exome arrays, we scanned the genome with single-marker analysis as implemented in PLINK (12). The single-marker analysis examined the breakpoints of all events with ≥70kb in length (approximation to a size threshold of ≥100kb commonly used for GWAS dataset (16)), for duplications and deletions separately and combined. Association tests (both region and single-marker) were one-sided with point-wise and multiple-testing adjusted empirical p values computed using 10, 000 permutations of case-control status.Genic CNVs were extracted as CNVs intersect any genes, where the gene model is the maximal transcripts for RefSeq mRNA genes resulting in a total of 23, 101 genes. Burden tests were performed using genic CNVs using PLINK (12)for duplications and deletions separately and combined, and for a range of CNV size classes. The rate of genic-CNVs was reported by PLINK(12), based on which fold change was computed as case-control ratio of genic-CNV-rate and indicates increased risk in disease per CNV rate.

Results

Human CNV data

Figure 1 summarizes our experimental workflow and CNV datasets. After quality control, there were 3, 962 cases with SCZ and 5, 138 controls genotyped using exome and GWAS arrays (Affymetrix 5.0 (3.4%), Affymetrix 6.0 (33.3%), and Illumina OmniExpress (63.3%)) (Table S2). The 250K SNPs on Illumina Human Exome BeadChips were derived from exome sequencing of 12, 028 European subjects (including ~500 subjects from this study), and met the following criteria: exonic or splice site variant of predicted functionality, minor allele observed a total of ≥3 times, minor allele observed in ≥2 different cohorts, passed sequencing quality control, and high Illumina SNP design scores.

Figure 1.

Figure 1

Experimental workflow and CNV datasets.

We considered CNV calls from genome-wide SNP arrays (i.e. GWAS arrays) as a best-practice reference dataset to which we compare CNV calls from exome arrays. CNVs from genome-wide SNP arrays are arguably the most prevalent and impactful source of CNV data to date. CNVs estimated from GWAS data generally have high sensitivity and specificity for large CNVs with substantial probe coverage(14). Three types of GWAS genotyping platforms were used. CNV calls may differ depend on the analytic tools employed (14). If not accounted for, these factors could confound the comparison of exome array CNVs to GWAS array CNVs. Therefore, to control for platform difference, we conducted the comparison using all subjects and using subsets of subjects stratified by GWAS array type. To control for analytic difference, we employed two established methods, Birdseye and PennCNV, for GWAS arrays.

Coverage

We evaluated the coverage of the exome and GWAS arrays across the genome, in genes, and in regions of interest. As expected, the exome array has inferior genome-wide coverage although its genic probe density is considerable (Table S3, Figure 2, Figure S2). With 96% genic probes, the exome array includes at least one SNP in 79% of all genes, comparable to GWAS arrays (81% Affymetrix 6.0, 82% Illumina Omni Express). Among the 79% genes, the number of probes per gene ranges 1-849 (median 9 probes) with a mean genic probe-density of 3.9 probes/20kb.

Figure 2. Probe content comparison genome-wide and in genes.

Figure 2

Figure (2a) shows a barplot comparing the number of probes (blue, Y-axis on the left) and the proportion of probes in genes (cyan, Y-axis on the right) between exome and GWAS arrays. Figure (2b) shows a barplot comparing probe density genome-wide (green) and within genes (gold). Within genes, the exome array has a higher mean probe density (3.85 probes/20 kb) than Affymetrix 5.0 (2.93 probes/20kb) but lower than high-density GWAS chips (12.47 probes/20kb for Affymetrix 6.0 and 5.23 probes/20kb for Omin Express).

The ability to call a CNV using the exome array depends on size and number of probes, and CNVs ≥400kb with a mean of ≥1 exome probe/20kb should be detectable(Figure 4). We evaluated exome array coverage in regions of known importance (Table S4) (4, 5): 33/37 autism regions are ≥400 kb (median 338 exome probes); 81/86 developmental delay regions are ≥400 kb (median 268 exome probes); 21/21 CNVs important for psychiatric disorders are ≥ 400 kb (median 186 exome probes); and 38/42 DECIPHER CNVs are ≥400 kb (median 175 exome probes). Thus, in concept, the exome array should allow detection of many large CNVs of importance in genomic medicine.

Figure 4.

Figure 4

Summary results for comparing CNV calls between Exome and GWAS arrays. GWAS array CNVs were used as the reference for all comparisons. Comparisons were stratified by CNV type (all CNVs in black, deletions in red, and duplications in blue) and by size (x-axis). Sensitivity and specificity.(4a) Sensitivity to detect any GWAS CNVs. (4b) Specificity of the exome array CNV dataset to detect any GWAS CNV, estimated by computing the proportion of exome array CNVs overlapping any GWAS CNVs for each size bin of the exome array CNVs. (4c) Sensitivity to detect GWAS CNVs limited to genic CNVs and accounting for probe coverage (intersect ≥1 gene and ≥1 exome array probe/20kb of its length). (4d) Specificity of the exome array CNV dataset compared to genic CNVs from GWAS arrays. Burden tests. The y-axis shows fold changes for CNV burden of cases versus controls, and the x-axes indicate CNV size bins (total numbers of CNVs per bin in parentheses).(4e) Burden test using genic CNVs from the GWAS dataset. (4f) Burden test using genic CNVs from the exome array dataset. Note that the X-axes stop at the particular bin when the total numbers of CNVs per bin (in parentheses) are comparable between (4e) and (4f) and hence the total number of bins displayed in (4e) and (4f) are different.

Initial evaluation of exome array CNVs

Sensitivity and specificity were estimated based on 1000 simulations via in sillico sex-mixing experiments. Full results are displayed in Table S5. In summary, sensitivity was 0.998 for deletions including ≥10 exome array probes. As expected, sensitivity was less for CNVs with poor probe coverage (<1 probe/20kb) or from noisier arrays (e.g., standard deviation of log-R ratio >0.35). Specificity was 0.997 for CNVs with confidence scores ≥10. These in silico simulation results suggested appropriate quality control thresholds that were implemented in subsequent analyses.

Exome versus GWAS arrays

We created stringent CNV callsets on the same subjects using exome array (using PennCNV) and GWAS genotyping (using Birdseye and PennCNV).The CNV callsets are compared in Table S6, Figure S4, and Figure S5. CNVs predicted from exome arrays are smaller than those from GWAS arrays. The total length of CNVs per subject from the exome array was ~60% of that of from genic GWAS CNVs. This result was expected based on our analyses of exome array coverage (see Figure 3, Figure S3 for CNV examples). Although GWAS and exome array CNVs both detected more duplications than deletions, the relative proportion of duplications was even greater from the exome array. This could result from technical differences (e.g., in probe design, CNV calling algorithms, or quality control), but it is possible that genic duplications reduce fitness less than genic deletions. As shown in Table S6, the relative proportion of duplications was more comparable after controlling for difference in probe design. In Figure S5, we compared CNV size distribution for exome arrays and for GWAS arrays stratified by genotyping batch (Sw1 through Sw6), array type (Affymetrix 5.0, Affymetrix 6.0, Illumina OmniExpress), and CNV calling method (Birdseye, PennCNV). For genome-wide CNVs, GWAS array datasets showed notable variability between batches, which was primarily due to difference in array type and secondarily due to difference in calling method. After restricting to genic-CNVs, GWAS array datasets were more comparable across strata. In contrast, exome array dataset showed consistency across strata.

Figure 3. Example CNVs detected by the exome array in 16p11.2 and 22q11.2.

Figure 3

In each sub-figure (3a:16p11.2; 3b:22q11.2), the X-axis indicates genomic position of exome array probes and the Y-axis indicates the values of LRR (top panel) or BAF (bottom panel). Red vertical lines: CNV boundaries predicted from GWAS chips. Blue dots: exome array probes involved in 16p11.2 duplication (3a). Red dots: exome array probes involved in 22q11.2 deletion (3b). Black dots: probes of normal copy. Additional examples are provided in Figure S3.

We contrasted GWAS to exome array CNVs. As discussed above, we considered GWAS array CNVs as the best-practice reference for estimating sensitivity and specificity (computational details described in Supplementary Methods, Figures S1). We estimated sensitivity by computing the proportion of GWAS CNVs captured by the exome array and specificity by the proportion of exome array CNVs overlapping with any GWAS CNV. We did for all CNVs and after stratifying by size and deletion/duplication type. First we conducted full-sample analysis using 9, 100 subjects that passed all QC steps and had both GWAS and exome array CNV calls. We then stratified the CNV data for 307 (3.4%) subjects genotyped using Affymetrix 5.0, 3, 030 (33.3%) using Affymetrix 6.0; and 5, 736 (63.3%) subjects using Illumina Omni Express. In sum, a total of eight comparisons (listed in Table S7) were conducted to control for differences in array type and CNV calling method.

Figure 4 displays representative results based on the full sample and comparing exome array CNVs to GWAS CNVs produced by Birdseye. Figure 4a shows the relatively poor sensitivity of the exome array for detecting GWAS CNVs genome-wide, a result anticipated from probe coverage. However, as expected, detection of genic CNVs is more sensitive. About 60% of GWAS CNVs intersect at least one gene and, of these, 46% had good exome probe coverage (i.e., ≥1 probe/20kb). Figures 4c and 4d compare genic CNVs between GWAS and exome arrays, and shows considerably higher sensitivity and higher specificity in comparison to the corresponding genome-wide estimates .For genic CNVs ≥400 kb with good probe coverage, the sensitivity to detect deletions was ~0.95 and specificity 0.68; for duplications, sensitivity and specificity were ~0.8. For genic CNVs ≥900 kb, sensitivity and specificity for deletions and duplications were 0.90–0.95. To evaluate the impact of the CNV frequency filter (i.e. selecting CNVs present in < 0.01 of all subjects), we repeated the analysis by filtering in one dataset and comparing to all calls in the second dataset and observed identical result for CNVs ≥400kb and 1% increase in sensitivity and specificity for CNVs <400kb.

Table S7 summarizes sensitivity and specificity estimated for genic CNVs ≥400 kb from all eight comparisons. Figures S6-S9 display sensitivity and specificity estimated for all CNV sizes using combinations of 2 subsamples (Sw2-4 genotyped on Affymetrix 6.0 or Sw5-6 genotyped on Illumina OmniExpress) and 2 calling methods for GWAS arrays (Birdseye or PennCNV). Sw1 was excluded because of its insufficient sample size (N = 307, 3.4%). Between types of GWAS arrays, we observed ~3–5% variability in sensitivity estimates and ~3–14% variability in specificity estimates. Between CNV calling methods within array type, we observed 2–4% variability in sensitivity estimates and 1–7% variability in specificity estimates. The best estimates were obtained using GWAS CNVs derived from Sw5-6 (Illumina) and PennCNV, where both platform and calling method were matched most closely between the exome array and the GWAS reference. In summary, the capability of exome array for CNV calling is robust. It detects large genic-CNVs with at least ~80% sensitivity and specificity for CNVs ≥400kb and ~90% sensitivity and specificity for CNVs ≥900kb.

Exome array versus exome sequencing CNVs

Fromer et al. recently developed XHMM, a method to call CNVs from exome sequencing read-depth data(17). As XHMM was developed using sequencing from ~1, 000 Swedish subjects on whom we also had exome array CNV calls, we repeated our analyses to compare CNVs from exome arrays to exome sequencing. We observed high agreement for genic CNVs ≥300kb (~81% sensitivity and ~71% specificity).

Known CNVs

Several large CNVs increase risk for schizophrenia (5), and are known to be present in these samples(16, 18). We therefore studied the ability of exome array to detect these CNVs (Table 1). Exome array CNVs had very high agreement with GWAS CNVs as 35/36 were correctly identified. One 16p11.2 duplication was missed due to sample-specific noise as measured in probe-intensity variance, and the post facto use of more liberal emission parameters rescued this CNV while compromising the specificity. Given its reliability, exome array CNVs were used to perform regional association tests with statistical significance established empirically. Two loci were significantly enriched in SCZ subjects: duplications at the 16p11.2 locus (9 cases, 1 control; P= 0.0033, multiple-testing adjusted P= 0.0033); and deletions at the 22p11.2 locus (7 cases, 1 control with partial deletion; P = 0.015, multiple-testing adjusted P = 0.034).

Table 1.

Exome versus GWAS array calls for CNVs of known importance in schizophrenia.

Large CNVs from the literature Exome array probe coverage GWAS Exome array
Ref. CNV Location (Mb) Kbp Probes %length Kbp per probe Total CNVs Total CNVs True discovery False discovery
(2, 16, 21) DEL 1q21.1-q21.2 chr1:145.8-147.8 1, 945 108 47% 18.01 8 8 8 0
(2) DEL 3q29 chr3:195.7-197.3 1, 630 253 91% 6.44 4 4 4 0
(2, 16) DEL 15q13.2-q13.3 chr15:30.9-32.5 1, 600 161 97% 18.63 5 5 5 0
(2, 22) DUP 16p11.2 chr16:29.5-30.3 460 163 77% 3.68 11 10 10 0
(2, 16) DEL 22q11.2 chr22:18.7-21.8 3, 150 498 92% 6.33 8 8 8 0

Note: A total of 9 loci were examined. Four loci are not included in Table 1:17q12, and 22q12.2 (no CNV was detected from either GWAS or exome arrays (100% agreement));2p16.3 and 7q36.3 (insufficient exome array coverage).

Novel loci

A novel nominal association was detected using the exome arrays at 11q12.2 (P = 0.0069, multiple-testing adjusted P = 0.18). Six deletions in cases with SCZ were detected with the smallest common region spanning chr11:60531180-60620982 (Table S8). All six deletions were also detected by GWAS arrays with the smallest common region spanning chr11: 60547604-60624496 (Table S8). Probe intensities are displayed in Figures S11-S12, confirming the accuracy of these deletions. No deletion was detected in controls.

Burden tests

Several studies reported a greater burden of rare CNVs in schizophrenia cases versus controls, particularly for genic CNVs(16, 19, 20). Burden tests were performed with respect to type and size of the genic CNVs. Figure 4e depicts the results for GWAS CNVs that were based on 9, 100 subjects and the Birdseye algorithm: for genic CNVs ≥100kb, cases with schizophrenia had 1.12x more CNVs than controls (1.13x for deletions and 1.11x for duplications). As in our prior report(16), deletions and duplications had somewhat different profiles: the association of deletion burden increased more noticeably with respect to CNV size while duplications showed a more gradual increase. Figure 4f shows burden results for exome array CNVs. Overall, these results also showed greater CNV burden in cases compared to controls, particularly for larger deletions. Intriguingly, the fold change estimates for the exome array CNVs were larger than for GWAS arrays (e.g., for exome array CNVs ≥400kb, cases had 1.53x increased burden overall, 2.22x higher for deletions, and 1.36x higher rate for duplications). The burden estimates for GWAS CNVs ≥400kb were considerably lower and did not reach similar fold change values until ≥900kb (1.54x, 2.19x, and 1.31x overall, deletions, and duplications).Similar analyses were obtained using subsamples and different calling methods for GWAS arrays (Figures S6(e, f) – S9(e, f)).

The higher fold change of genic-CNV-rate observed from exome arrays can be attributed to several technical reasons. The primary reason was that relatively uniform sample-processing reduced experimental variability. Figure S10 compares the distribution of the standard deviation of normalized and transformed probe intensities (i.e. LRR_SD) between GWAS arrays and exome arrays. LRR_SD was used as a surrogate measure of experimental noise. GWAS arrays showed experimental variability between genotyping batches and platforms. In contrast, exome arrays showed consistency for the entire cohort, as all samples were scanned on a common platform in the same facility within a short time window. Additional explanation is probe design differences between types of GWAS arrays and between GWAS and exome arrays. We conducted burden tests using the rate of genic CNVs and adjusted for the shorter CNV length from exome arrays. These strategies accounted for most difference in probe design between GWAS and exome arrays but residual effect remains.

Discussion

In a large Swedish schizophrenia case/control sample, we demonstrated that the Illumina Human Exome BeadChip can be used to detect large genic CNVs with ~80% sensitivity and specificity for CNVs ≥400kb and ~90% sensitivity and specificity for CNVs ≥900kb. CNVs known to be important to schizophrenia were genotyped with high accuracy. The data strongly confirm the association of schizophrenia with 16p11.2 duplications and 22q11.2 deletions. A greater burden of genic CNVs in schizophrenia cases was detected, and the exome array yielded higher burden estimates and greater significance levels than for GWAS arrays.

The relative homogeneity of this Swedish sample is one of its main strengths. The quality of a CNV call set depends on multiple factors including coverage, resolution, the realized signal:noise ratios, and the performance of the detection algorithms (14, 16). Our comparative results are robust and generalizable given the use of two different CNV calling algorithms and two mainstay GWAS genotyping platforms. Other computational methods should be explored for CNVs detection from the exome array, and the use of multiple computational methods may be beneficial (14, 16). An important consideration is that many exome array SNPs are monomorphic, and allelic ratios or relative intensities do not contribute to CNV detection. Thus, the power to detect CNVs mainly depends on the total intensities, and primarily reflects CNV size.

We suggest that future work should evaluate smaller exome array CNVs carefully as some proportion might be true CNVs potentially important to schizophrenia but undetected using GWAS arrays. The smallest CNVs(7.8% <1kb, 19% 1–5kb, and 11% 5–10kb)often span a large number of exome array probes. Based on our simulation study, some small deletions spanning ≥10 probes might be real but missed by the GWAS. However, some of these small CNVs might result from aberrant probe intensities due to technical bias such as rare alleles on the exome array. These results could be combined with high-throughput sequencing data to identify CNVs and establish precise CNV boundaries.

A novel nominally associated 90kb deletions were observed at 11q12.2 using exome arrays. This result was based on an exploratory analysis using 9, 100 subjects that had both exome and GWAS array data. Comprehensive CNV-disease association analyses using this Swedish sample are reported elsewhere.

In conclusion, our results suggest the utility of exome-focused arrays in surveying large genic CNVs in very large samples. Many large cohorts for psychiatric and non-psychiatric diseases are being scanned on this common platform. The widespread use of exome arrays suggests new opportunities such as conducting well-powered CNV assessment and comparisons between different diseases. Such cross-disorder CNV meta-analysis will facilitate our understanding of the contributions of large genic CNVs to psychiatric diseases by establishing precise estimates of their effect size and quantifying their phenotypic impact. The use of a single platform minimizes potential confounding factors that could impact accurate detection.

Supplementary Material

Additional Supplemental info

Acknowledgments

Funding was from K01 MH093517 (JPS), R01 MH077139 (PFS), the Stanley Center for Psychiatric Research, the Karolinska Institutet, Karolinska University Hospital, the Swedish Research Council, an ALF grant from Swedish County Council, the Söderström Königska Foundation, and the Sylvan Herman Foundation. The funders had no role in study design, execution, data collection and analysis, decision to publish, or preparation of the manuscript.

This study makes use of data generated by the DECIPHER Consortium. A full list of centers who contributed to the generation of the data is available from http://decipher.sanger.ac.uk or from decipher@sanger.ac.uk. Funding for the project was provided by the Wellcome Trust.

The authors thank two anonymous reviewers for their helpful comments. All authors reviewed and approved the final version of the manuscript. The corresponding authors had access to the full dataset.

Footnotes

Conflict of Interest

Dr Sullivan was on the SAB of Expression Analysis (Durham, NC, USA). The other authors report no financial conflicts of interest.

Supplementary Information

Supplementary Information accompanies the paper on the Molecular Psychiatry website including Supplemental methods, results, and references. The supplemental methods include a detailed description of how to obtain CNV calls from exome arrays.

References

  • 1.Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–455. doi: 10.1146/annurev-med-100708-204735. [DOI] [PubMed] [Google Scholar]
  • 2.Levinson DF, Duan J, Oh S, Wang K, Sanders AR, Shi J, et al. Copy number variants in schizophrenia: Confirmation of five previous findings and new evidence for 3q29 microdeletions and VIPR2 duplications. Am J Psychiatry. 2011 Feb 1;168:302–316. doi: 10.1176/appi.ajp.2010.10060876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sanders SJ, Ercan-Sencicek AG, Hus V, Luo R, Murtha MT, Moreno-De-Luca D, et al. Multiple Recurrent De Novo CNVs, Including Duplications of the 7q11.23 Williams Syndrome Region, Are Strongly Associated with Autism. Neuron. 2011 Jun 9;70(5):863–885. doi: 10.1016/j.neuron.2011.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cooper GM, Coe BP, Girirajan S, Rosenfeld JA, Vu TH, Baker C, et al. A copy number variation morbidity map of developmental delay. Nature genetics. 2011 Sep;43(9):838–846. doi: 10.1038/ng.909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sullivan PF, Daly MJ, O'Donovan M. Genetic architectures of psychiatric disorders: the emerging picture and its implications. Nature Reviews Genetics. 2012;13:537–551. doi: 10.1038/nrg3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Weiss LA, Shen Y, Korn JM, Arking DE, Miller DT, Fossdal R, et al. Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med. 2008 Feb 14;358(7):667–675. doi: 10.1056/NEJMoa075974. [DOI] [PubMed] [Google Scholar]
  • 7.McCarthy SE, Makarov V, Kirov G, Addington AM, McClellan J, Yoon S, et al. Microduplications of 16p11.2 are associated with schizophrenia. Nat Genet. 2009 Nov;41(11):1223–1227. doi: 10.1038/ng.474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Jacquemont S, Reymond A, Zufferey F, Harewood L, Walters RG, Kutalik Z, et al. Mirror extreme BMI phenotypes associated with gene dosage at the chromosome 16p11.2 locus. Nature. 2011 Oct 6;478(7367):97–102. doi: 10.1038/nature10406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Walters RG, Jacquemont S, Valsesia A, de Smith AJ, Martinet D, Andersson J, et al. A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature. 2010 Feb 4;463(7281):671–675. doi: 10.1038/nature08727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.O'Dushlaine C, Ripke S, Chambert K, Moran JL, Kähler A, Akterin S, et al. Genome-Wide Association Of Schizophrenia. Sweden: Submitted. [Google Scholar]
  • 11.Goldstein JI, Crenshaw A, Carey J, Grant G, Maguire J, Fromer M, et al. zCall: a rare variant caller for array-based genotyping. Bioinformatics. 2012;28:2543–2545. doi: 10.1093/bioinformatics/bts479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007 Nov;17(11):1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nature biotechnology. 2011 Jun;29(6):512–520. doi: 10.1038/nbt.1852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008 Oct;40(10):1253–1260. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature. 2008;455:237–241. doi: 10.1038/nature07239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, et al. Discovery and statistical genotyping of copy number variation from whole-exome sequencing depth. American Journal of Human Genetics. 2012;91:597–607. doi: 10.1016/j.ajhg.2012.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bergen SE, O'Dushlaine CT, Ripke S, Lee PH, Ruderfer D, Akterin S, et al. Genome-wide association study in a Swedish population yields support for greater CNV and MHC involvement in schizophrenia compared to bipolar disorder. Molecular Psychiatry. 2012;17:880–886. doi: 10.1038/mp.2012.73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M. Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet. 2008 May 30; doi: 10.1038/ng.162. [DOI] [PubMed] [Google Scholar]
  • 20.Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008 Mar 27;320:539–543. doi: 10.1126/science.1155174. [DOI] [PubMed] [Google Scholar]
  • 21.Stefansson H, Rujescu D, Cichon S, Pietilainen OP, Ingason A, Steinberg S, et al. Large recurrent microdeletions associated with schizophrenia. Nature. 2008;455:232–236. doi: 10.1038/nature07229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Vacic V, McCarthy S, Malhotra D, Murray F, Chou HH, Peoples A, et al. Duplications of the neuropeptide receptor gene VIPR2 confer significant risk for schizophrenia. Nature. 2011 Mar 24;471(7339):499–503. doi: 10.1038/nature09884. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional Supplemental info

RESOURCES