Abstract
Purpose
Chromosomal microarray analysis to assess copy number variation (CNV) has become a first tier genetic diagnostic test for individuals with unexplained neurodevelopmental disorders (NDD) or multiple congenital anomalies (MCA). Over 100 cytogenetic labs worldwide use the new ultra-high resolution Affymetrix CytoScan-HD array to genotype hundreds of thousands of samples per year. Our aim was to develop a CNV resource from a new population sample, which would enable more accurate interpretation of clinical genetics data on this microarray platform, and others.
Methods
Genotyping of 1,000 adult volunteers who are broadly representative of the Ontario population (as obtained from the Ontario Population Genomics Platform) was performed with the CytoScan-HD microarray system, which has 2.7 million probes. Four independent algorithms were applied to detect CNVs. Reproducibility and validation metrics were quantified using sample replicates and quantitative-PCR, respectively.
Results
DNA from 873 individuals passed quality control and we identified 71,178 CNVs (81 CNVs/individual); 9.8% (6,984) of these CNVs were previously unreported. After applying three layers of filtering criteria, from our highest confidence CNVs dataset we obtained >95% reproducibility and >90% validation rate (73% of these CNVs overlapped at least one gene).
Conclusion
The genotype data and annotated CNVs for this largely Caucasian population will represent a valuable public resource enabling clinical genetics research and diagnostics.
Keywords: Copy Number Variation (CNV), CytoScan-HD, Microarray, Neurodevelopmental Disorders (NDD), Congenital Abnormalities (MCA)
INTRODUCTION
Copy number variations (CNVs) constitute an abundant form of genetic variation and are increasingly being linked to genetic and phenotypic diversity, as well as disease1-4. A wealth of literature exists for a significant role of CNVs in neurodevelopmental disorders (NDDs) and multiple congenital abnormalities (MCA)5,6. For example, a review of 33 published studies by The International Standard Cytogenomic Array Consortium (ISCA) showed that ~12% of NDD cases can be explained by a CNV7. The clinical yield for autism spectrum disorders (ASD) in recent studies shows that at least 5-15% of cases can be explained by CNVs that are either de novo or rare inherited in nature8-10. Since most characterized penetrant CNVs are inherently rare, population scale analyses are often required to assess relative disease risk and to elucidate the potential etiologic role of genetic events currently classified as “variants of unknown significance” (or VOUS)7.
The detection of CNVs in the clinical diagnostic setting is now largely based on an initial scan of the genome using microarrays to search for unbalanced alterations7,9,10. Locus, gene, and even exon-specific quantitative assays are also now used when a specific hypothesis is being pursued (e.g. when clinical assessment suggests a particular disease gene/mutation). In both instances, knowing the full spectrum of allelic architecture is necessary to make accurate clinical interpretations11. For these reasons, newer microarrays are being developed that contain dense probe content to allow robust testing for single nucleotide polymorphism (SNP) genotypes and CNV detection. Dense SNP coverage allows zygosity testing, including assessment of uniparental disomy, and also sub-population structure analysis.
Recently, Affymetrix Corporation developed an array (CytoScan-HD) that consists of 2.7 million (M) probes. While these cover the entire genome, the densest representation is within genes and even denser in known OMIM genes. In a recent study, high resolution array assays in a small cohort of ASD and intellectual disability (ID) samples showed higher diagnostic yields and the capability to detect clinically relevant, smaller CNVs12. Moreover, in North America alone, over 100 cytogenetic labs are now using the CytoScan-HD platform both for constitutional and cancer DNA testing. Recently, CytoScan-Dx assay (the clinical name for the equivalent CytoScan-HD) obtained FDA clearance for its use as a post-natal test for NDD or MCA cases.
Having a large control series that is broadly representative of the underlying population that is genotyped with identical technology platforms provides the ideal situation for CNV calling13. Surprisingly, in the Database of Genomic Variants(DGV)14, which is the standard resource used for CNV comparisons, only 44 population datasets from 55 studies are represented. Moreover, for these important studies, 41 different technology platforms have been used, and none of the data are yet derived from the CytoScan-HD array. Here we genotype population-based samples from adult volunteers in the Ontario Population Genomics Platform (OPGP), using the CytoScan-HD array to generate the first such publicly available population dataset. DNA and cell lines from this unique biological resource are also available for additional studies.
MATERIALS AND METHODS
The study was performed with direct participant consent and the approval of the Research Ethics Boards at The Hospital for Sick Children and Mount Sinai Hospital, Toronto (studies 1000008876 and 06-0014-E, respectively).
OPGP Sample Collection
The OPGP consists of data and biospecimens collected from 2,690 adult volunteers from across Ontario, for whom recruitment was done in two phases (see details for overall OPGP in Supplemental Section 1 and Tables S1-3). Participants were first recruited through collaborations with the Ontario Familial Breast Cancer Registry (OFBCR)15 and the Ontario Familial Colorectal Cancer Registry (OFCCR),16 which are research resources used by international consortia in large studies of familial breast (OFBCR) and colorectal (OFCCR) cancers. These registries contain previously collected data and biospecimens from cancer patients, family members, plus individuals from the general population who serve as controls. Population controls from these two registries were contacted and invited to participate in the OPGP and to re-consent so their previously collected data and biospecimens could be accessed by the OPGP. Re-consent was requested from a total of 1,886 controls from these registries, resulting in 1,462 controls being included in the OPGP (903 from the colorectal cancer registry and 559 from the breast cancer registry, for a 78% re-consent rate – See Supplementary Table S1).
In the second phase, adult (ages 20-79 years) volunteers residing across Ontario were recruited through a survey research process that involved random sampling from telephone directories, mailed introductions, and a combination of telephone interview and mailed questionnaires. Consenting individuals were mailed a package containing an explanatory letter, consent forms with a pre-paid return envelope, and a blood kit for collection at a clinical laboratory in their community. The blood sample was sent via courier to the biospecimen repository at The Centre for Applied Genomics (TCAG) at The Hospital for Sick Children for transformation and DNA preparation (see details in Supplemental section 2). Of the 3,519 who completed the initial survey research process, blood sample collection kits were sent to all who consented (n=2,074), among whom 1,228 (overall participation rate = 35%) returned both the specimen and signed consent form, and were included in the OPGP.
DNA Genotyping, CNV Analysis and Quality Control
DNA was genotyped (see Supplementary section 2) using the CytoScan-HD array following the manufacturer’s protocol. The array consists of 2,696,550 probes that include 743,304 SNPs and 1,953,246 non-polymorphic probes. The average probe spacing for RefSeq genes is 880bp and 96% of genes are represented. For this analysis, we have genotyped 1,000 samples and after extensive quality control (see Supplementary section 3) the OPGP subset for whom genotyping results are reported consists of 873 individuals and 22 sample replicates.
To achieve comprehensive CNV detection, we used four separate algorithms; Affymetrix Chromosome Analysis Suite (ChAS), iPattern, Nexus and Partek. ChAS is the algorithm designed for use in clinical cytogenetic laboratories. Details of the other programs are found in Supplemental section 4. Our primary analysis was performed based on ChAS CNV calls, which were then supported using the remaining three algorithms to construct a set of high-confidence CNVs13. For all algorithms, we used 8 probes and >1kb as a minimum cutoff. Raw data from CNV genotyping are available in the NCBI database of Gene Expression Omnibus (GEO) under accession GSE59150 and the CNV calls can be downloaded from: http://www.tcag.ca/documents/projects/opgp873_chas.8p_1kb_one_replicates.txt.
Ancestry Inference
To infer ancestry of the OPGP samples, we used 1,257 HapMap III samples (547,362 common SNPs with >95% call rates) as reference for 11 ethnically diverse populations17(see details in Supplemental section 5).
Experimental CNV Validation and Reproducibility
To examine the accuracy of our CNV calls we used two different approaches. First, we used a CNV data set from 345-OPGP samples previously genotyped18 using the lower-resolution Affymetrix genome-wide Human SNP 6.0 array. Second, we randomly selected 12 CNV regions in different size bins (ranging from 1.5kb to 2.8MB) and experimentally validated them using quantitative-PCR (qPCR). Each qPCR assay was performed in triplicate, for the test region and controls well-established to be diploid. The ratio of the average value for the test region to that for the control region had to be >1.4 or <0.7 in order for the CNV to be confirmed as a copy number gain (duplication) or copy number loss (deletion), respectively. In addition, the standard error of the ratio had to be <1.0 on the same scale in order for the assay to be considered reliable. To measure reproducibility of the microarray assay we compared CNV calls from 22 randomly selected samples tested in replicate.
RESULTS
Ancestry Determination
The majority of individuals in the OPGP cohort (95%) self-reported as being of European descent (Figure 1A), with the remaining 5% coming from African, Chinese, First Nations, Middle Eastern, South Asian, and South American backgrounds. Our inferred ancestry analysis using SNP genotypes shows strong concordance with the self-reported ancestry. The MDS plot (Figure 1B) shows that the majority of the genotyped OPGP subset was highly clustered with the HapMap CEU population. The detected ancestry from PLINK analysis (Figure 1C) is also highly correlated: 94% Caucasian, 3.11% South American, 1.91% Asian and 1% from an admixed population.
CNV Distribution and Reproducibility
After strict QC, our final dataset consisted of CNVs from 873 unrelated individuals (477 male and 396 female; mean age 58 years) (Supplementary Figure S1 and see demographics in Table S3). Since ChAS is the typical CNV detection program used for this array, we used it as our primary algorithm for CNV identification. Overall we have not observed any frequency difference between males and females common or rare CNV distribution (Supplementary Table S4 and Figure S2). As we have shown elsewhere13, to increase the sensitivity and specificity of CNV detection we also used three other programs. CNV calls were stratified into three groups with increasingly stringent cutoffs: a) “basic filter”- representing the entire CNV set exceeding at least 1kb in length and having a minimum of eight consecutive probes; b) “research set” - a subset of the basic filter where all the CNVs require the support of at least two algorithms (ChAS plus a second algorithm); and c) the “clinically stringent set”, which includes CNVs with a size and probe threshold of 25kb and 25 probes for losses, and 50kb and 50 probes for gains (Figure 2A).
Applying the basic filter, we detected 71,178 CNVs with the majority being losses (56,442) compared to gains (14,736). CNV sizes ranged from 1kb to 4.3MB, with a median size of 9.95kb (Figure 2B and Supplementary Table S4). Rare (<1% population frequency) large CNVs (>100kb) comprised 5.2% of the CNVs detected. Male and female samples possessed 38,427 (54%) and 32,751 (46%) of CNVs, respectively. Importantly, 6,984 of the variants (mean size 7.7kb) within the OPGP cohort are novel, having not been reported in any other studies (Supplementary Table S5) within Database of Genomic Variants (DGV)14. This array is characterized by a high probe density for genic regions and, therefore, 62% of the detected CNVs overlapped with at least one gene. The reproducibility computed from 22 replicates (with at least 50% reciprocal overlaps) for the “basic filter” shows that >77% CNVs (both losses and gains) are reproducible (Figure 2C).
After applying the “research set” filter, we obtained 34,502 CNVs (10,271 gains, and 24,231 losses) with a median size of 13kb (Fig. 2A). The genic CNV rate remained unchanged (~62.7%), but the proportion of large (>100kb) CNVs increased to 9.2% and reproducibility to 85% for both losses and gains.
In contrast, the “clinically stringent set” contained 6,965 high confidence CNVs (2,576 gains, 4,389 losses) with a median size of 79 kb; 73% of CNVs within this specific tier are genic and reproducibility is >96% for both losses and gains (Fig. 2A-C). Comparison with the Affymetrix SNP array 6.0 data set showed that 81% of “research set” and 90% of “clinically stringent set” CNV calls were concordant between microarrays. Our qPCR validation set included 12 randomly chosen CNVs of different lengths from the “basic filter” CNV set and 11 of 12 (91%) were validated by this method.
DISCUSSION
We present a new CNV resource derived from a North American population originating from Ontario, Canada. This is the first such public resource of data available for CNVs genotyped on the CytoScan-HD array. The resulting data should have tremendous value to guide diagnostic labs that are increasingly using CytoScan-HD array or the FDA approved CytoScan-Dx array to detect and assess the relevance of chromosomal abnormalities. In this study, we analyzed the CNV data in different stringency tiers (basic, research, and clinical filter) to facilitate investigation of research questions as well as for the appropriate clinical interpretation and prioritization of variants.
Our analysis found 6,984 CNVs not described previously. Many of these are small CNVs in the 1-15kb range, which have previously been incompletely characterized19. This higher-resolution analysis allows detection of novel CNVs affecting only small regions within in a gene (e.g. ADD2; Supplementary Figure S2), as well as to help better define breakpoints of existing CNV calls (e.g. PRIME2; Supplementary Figure S4).
Comparing (>70% reciprocal overlap) with DECIPHER database, we have found 10 OPGP samples harboring pathogenic gains and losses for five distinct genomic disorders (Supplementary Table S7). For example, CNV genotyping of the OPGP samples detected variants overlapping 16p13.11 region associated with male biased neurodevelopmental disorders (Supplementary Figure S5), as well as known disease causing or risk genes (e.g. PARK2; Supplementary Figure S6). In one example from our recent work20, isoform-specific small deletions within the ASTN2/TRIM32 genes in males were implicated in NDD with diverse phenotypes. This segment of the genome is well-represented on the CytoScan-HD array and in fact, a smaller isoform-specific deletion was also detected in a male individual within OPGP cohort.
Ultimately, high-resolution CNV calls using microarrays and sequencing will enable the construction of a chromosome imbalance map of the human genome. To best facilitate application in the clinical genetics setting, the data used for this map should be as accurate as possible and incorporate all geographic populations. In this work, we add valuable data and accompanying biospecimens to support such future clinical genetic research studies.
Supplementary Material
Acknowledgments
The Ontario Population Genomics Platform was established with funding from The Centre for Applied Genomics (TCAG) and infrastructure support from the Canada Foundation for Innovation (CFI). Participant recruitment was made possible in collaboration with the Institute for Social Research at York University, the Ontario Familial Breast Cancer Registry (also supported by grant UM1 CA164920 from the USA National Cancer Institute (NCI) and the Ontario Familial Colorectal Cancer Registry (also supported by grant UM1 CA167551 from the NCI and through a cooperative agreement U01/U24 CA074783), for which we thank Irene Andrulis, Michelle Cotterchio, Steve Gallinger and Teresa Selander. The content of this manuscript does not necessarily reflect the views or policies of the NCI, the host organizations or collaborating centers, nor does mention of trade names, commercial products, or organizations imply endorsement by the government, funding agencies, host organizations, or collaborating centres. We thank Ting Wang, Kozue Otaka, and Guillermo Casallo their help on the sample preparation. We also thank Alan H. Roter and Sam Dougaparsad from Affymetrix for their technical help. We thank the TCAG Science and Technology Innovation Centre (STIC), which is funded by Genome Canada and the Ontario Genomics Institute, the CFI, and the Ontario Research Fund of the Government of Ontario. The project was also supported by funds from the University of Toronto McLaughlin Centre and Genome Canada. S.W.S. holds the GlaxoSmithKline-CIHR Chair in Genome Sciences at the University of Toronto and The Hospital for Sick Children.
References
- 1.Beckmann JS, Estivill X, Antonarakis SE. Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nature reviews. Genetics. 2007 Aug;8(8):639–646. doi: 10.1038/nrg2149. [DOI] [PubMed] [Google Scholar]
- 2.Buchanan JA, Scherer SW. Contemplating effects of genomic structural variation. Genetics in medicine : official journal of the American College of Medical Genetics. 2008 Sep;10(9):639–647. doi: 10.1097/gim.0b013e318183f848. [DOI] [PubMed] [Google Scholar]
- 3.Conrad DF, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010 Apr 1;464(7289):704–712. doi: 10.1038/nature08516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lee C, Scherer SW. The clinical context of copy number variation in the human genome. Expert reviews in molecular medicine. 2010;12:e8. doi: 10.1017/S1462399410001390. [DOI] [PubMed] [Google Scholar]
- 5.Cook EH, Jr., Scherer SW. Copy-number variations associated with neuropsychiatric conditions. Nature. 2008 Oct 16;455(7215):919–923. doi: 10.1038/nature07458. [DOI] [PubMed] [Google Scholar]
- 6.Pinto D, Pagnamenta AT, Klei L, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature. 2010 Jul 15;466(7304):368–372. doi: 10.1038/nature09146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Miller DT, Adam MP, Aradhya S, et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. American journal of human genetics. 2010 May 14;86(5):749–764. doi: 10.1016/j.ajhg.2010.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Devlin B, Scherer SW. Genetic architecture in autism spectrum disorder. Current opinion in genetics & development. 2012 Jun;22(3):229–237. doi: 10.1016/j.gde.2012.03.002. [DOI] [PubMed] [Google Scholar]
- 9.Shen Y, Dies KA, Holm IA, et al. Clinical genetic testing for patients with autism spectrum disorders. Pediatrics. 2010 Apr;125(4):e727–735. doi: 10.1542/peds.2009-1684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Stobbe G, Liu Y, Wu R, Hudgings LH, Thompson O, Hisama FM. Diagnostic yield of array comparative genomic hybridization in adults with autism spectrum disorders. Genetics in medicine : official journal of the American College of Medical Genetics. 2014 Jan;16(1):70–77. doi: 10.1038/gim.2013.78. [DOI] [PubMed] [Google Scholar]
- 11.Uddin M, Tammimies K, Pellecchia G, et al. Brain-expressed exons under purifying selection are enriched for de novo mutations in autism spectrum disorder. Nature genetics. 2014 Jul;46(7):742–747. doi: 10.1038/ng.2980. [DOI] [PubMed] [Google Scholar]
- 12.Qiao Y, Tyson C, Hrynchak M, et al. Clinical application of 2.7M Cytogenetics array for CNV detection in subjects with idiopathic autism and/or intellectual disability. Clinical genetics. 2013 Feb;83(2):145–154. doi: 10.1111/j.1399-0004.2012.01860.x. [DOI] [PubMed] [Google Scholar]
- 13.Pinto D, Darvishi K, Shi X, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nature biotechnology. 2011 Jun;29(6):512–520. doi: 10.1038/nbt.1852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic acids research. 2014 Jan;42(Database issue):D986–992. doi: 10.1093/nar/gkt958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Figueiredo JC, Knight JA, Briollais L, Andrulis IL, Ozcelik H. Polymorphisms XRCC1-R399Q and XRCC3-T241M and the risk of breast cancer at the Ontario site of the Breast Cancer Family Registry. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology. 2004 Apr;13(4):583–591. [PubMed] [Google Scholar]
- 16.Cotterchio M, McKeown-Eyssen G, Sutherland H, et al. Ontario familial colon cancer registry: methods and first-year response rates. Chronic diseases in Canada. 2000;21(2):81–86. [PubMed] [Google Scholar]
- 17.International HapMap C. Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010 Sep 2;467(7311):52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Costain G, Lionel AC, Merico D, et al. Pathogenic rare copy number variants in community-based schizophrenia suggest a potential role for clinical microarrays. Human molecular genetics. 2013 Nov 15;22(22):4485–4501. doi: 10.1093/hmg/ddt297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Pang AW, Macdonald JR, Yuen RK, Hayes VM, Scherer SW. Performance of high-throughput sequencing for the discovery of genetic variation across the complete size spectrum. G3. 2014;4(1):63–65. doi: 10.1534/g3.113.008797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lionel AC, Tammimies K, Vaags AK, et al. Disruption of the ASTN2/TRIM32 locus at 9q33.1 is a risk factor in males for autism spectrum disorders, ADHD and other neurodevelopmental phenotypes. Human molecular genetics. 2014 Jan 14; doi: 10.1093/hmg/ddt669. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.