Our objective was to design a genotyping platform that would allow rapid genetic characterization of samples in the context of genetic mutations and risk factors associated with common neurodegenerative diseases. The platform needed to be relatively affordable, rapid to deploy, and use a common and accessible technology. Central to this project, we wanted to make the content of the platform open to any investigator without restriction. In designing this array we prioritised a number of types of genetic variability for inclusion, such as known risk alleles, disease causing mutations, putative risk alleles, and other functionally important variants. The array was primarily designed to allow rapid screening of samples for disease causing mutations, and large population studies of risk factors. Notably, an explicit aim was to make this array widely available to facilitate data sharing across and within diseases.
The resulting array, NeuroX, is a remarkably cost and time effective solution for high quality genotyping. NeuroX comprises a backbone of standard Illumina exome content of approximately 240,000 variants, and over 24,000 custom content variants focusing on neurological diseases. Data is generated at ~$50–$60 per sample using a 12-sample format chip and regular Infinium infrastructure; thus genotyping is rapid, and accessible to many investigators. Here, we describe the design of NeuroX, discuss the utility of NeuroX in the analyses of rare and common risk variants, and present quality control metrics and a brief primer for the analysis of NeuroX derived data.
The availability of economical custom content additions to genome-wide or exome-wide genotyping arrays has permitted the development of tailored arrays for both genetic discovery and replication efforts. In the last few years it has become evident that in the second wave of GWA investigators sought to investigate variants below the threshold of genome wide significance, and fine map extant signals. Such an effort requires large-scale replication efforts involving the assay of large numbers of samples, and the interrogation of a very large number of candidate variants. A fairly inefficient approach to this problem was being used, where investigators representing single disease research groups pursued replication in isolation of other efforts both within and across diseases. In 2011 the National Institute of Neurological Disorders and Stroke (NINDS) convened a meeting that included investigators researching myriad common neurodegenerative diseases with the intent of identifying a more efficient solution. This meeting involved representatives from genetics groups leading GWA in Alzheimer’s disease (AD), Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), multiple sclerosis (MS), and frontotemporal dementia (FTD), among others. There was a broad consensus that the design of an accessible array that could type variants of interest for all major neurodegenerative diseases would be of great utility; such an array had the potential to benefit from an economy of scale, to reduce cost by allowing easy sharing of controls, and allow direct comparison of genetic data across diseases. In response to that consensus, we modified the design of an array originally intended to serve as a replication assay for a large PD meta-analysis to include a wide variety of content relevant to the broader neurodegenerative disease research community. Here we describe the content and use of this array, called NeuroX.
In this effort, summary statistics for the largest available genome-wide association studies (GWAS) were mined to nominate known and candidate loci tagging risk for AD, FTD, multiple system atrophy (MSA), myasthenia gravis (MG), Charcot Marie Tooth (CMT), progressive supranuclear palsy (PSP), ALS, and PD. Where available, putative risk variants identified by exome sequencing of familial and population based samples, as well as those derived from literature review for the above diseases, were also included on the array. We also performed a systematic literature and database search for all mutations known to cause neurological disease. Technical redundancies and reliable proxies were used for priority SNPs to guarantee quality genotyping calls produced by the array. This custom Neuro content includes over 24,000 neurodegenerative-focused-variants; this custom library can be added to many off-the-shelf Illumina Infinium products, however, here we describe the use of this library when added to Illumina’s Infinium HumanExome BeadChip, a product we have named NeuroX. Thus NeuroX includes: full exome sequencing based variability standard to the Illumina HumanExome array v1.1 (242,901 variants) and neurological and neurodegenerative disease focused content (24,706 variants). In addition to the ability to add the custom Neuro library to other illumina genotyping arrays, it is also relatively easy to add new custom variants should the need arise. In this paper however, we describe the initial version of the NeuroX array, comprising the base exome and existing custom content.
From its inception the NeuroX array was designed to be a rapid and cost effective solution for high quality genotype data. The current cost of the array is ~US$57 per sample and genotyping of thousands of samples per week is achievable in most core laboratories, with this estimate including reagents but excluding labor and previous Illumina infrastructure costs. It is also notable that we are in the process of making a large amount of NeuroX data publicly available (dbgap address pending).
Array Design
The custom content available on the NeuroX array was taken from three primary sources: large-scale GWAS, high throughput sequencing of families and cohort studies, and literature searches to identify risk factors and disease-causing mutations.
For GWAS based datasets we mined participant level data, when available, for diseases such as PD, ALS, FTD, and MG, including both published and unpublished datasets (ALSGEN Consortium et al., 2013; Chiò et al., 2009; Do et al., 2011; International Parkinson Disease Genomics Consortium et al., 2011; International Parkinson’s Disease Genomics Consortium (IPDGC) and Wellcome Trust Case Control Consortium 2 (WTCCC2), 2011; Lill et al., 2012; Mok et al., 2012). Participant level GWAS data for AD and PSP were not available to our group at the time of chip design, so publicly available GWA loci for these diseases were included (1000 Genomes Project Consortium et al., 2012; Höglinger et al., 2011; Hollingworth et al., 2011; Lambert et al., 2009). Genome-wide significant loci from diseases of interest were included with either multiple proxies for the top SNP at every locus, or technical replicates, if proxies were not available. We have included up to 5 variants per significant locus. Loci were defined as any SNP reaching a genome-wide significant p-value and correlated at r2 < 0.50 with any other significant SNPs within 250 kilobases for each disease of interest. All analyses were derived from at least 1000 Genomes level SNP coverage and used participant level data from the 1000 Genomes project to nominate proxies when possible. In addition, locus tagging SNPs were included to allow for the identification of new loci in larger sample series. For all SNPs associated in GWAS with diseases of interest that reached candidate p-values of 1E-4 or stronger, additional haplotype tagging SNPs were placed on the NeuroX array, in an attempt to facilitate future genotype imputation efforts. Tagging SNPs were selected based on an r2 in 1000 Genomes samples at less than 0.50 with any other SNPs meeting the same p-value threshold within a 250 kilobase window, allowing for regional assessments of genetic variability. Whenever possible, GWAS based SNPs that were not the most significant within the locus were replaced by a proxy meeting the above criteria if array design scores for the probe associated with that SNP failed (quality less than 0.80 and no array validation), as a means of only using higher quality SNPs on the NeuroX array. This led to the successful inclusion of almost 16,000 GWAS derived variants or GWAS related variants across multiple disease sources.
Sequence based data generated by pilot studies within our consortia (both exome and genome sequencing) was mined to nominate rare and coding variants for inclusion on the NeuroX array. This data comes from familial and cohort studies looking into AD, MSA, FTD, CMT, MSA, PSP, ALS, and PD. Cohort-derived sequence-based data was inclusive of any rare and coding variants at a frequency of less than 5% in the population from which the pilot data was collected. For data extracted from family-based sequencing studies, variants were filtered and only those not appearing in the 1000 Genomes Project and the NHLBI’s Exome Sequencing Project database were included (1000 Genomes Project Consortium et al., 2012; NHLBI GO Exome Sequencing Project, n.d.). This led to the successful inclusion of 7485 rare sequence-based variants.
An extensive systematic review of published literature was performed to include variants known to be involved in neurological or neurodegenerative diseases for nomination onto the array. Briefly, we performed PubMed searches using the gene name and the word “mutation” as search parameters to identify articles describing mutations. The genes searched for were: ABCA7, ACE, APOE, APP, ATP13A2, BACE1, CHMP2B, CLCN6, CLN3, CLN5, CLN6, CLN8, CSF1R, CST3, CTSD, DNAJC5, ECE2, FBXO7, FUS, GBA, GLA, GLB1, GRN, GUSB, HEXA, HEXB, LRRK2, MAPT, MFSD8, NEU1, NOTCH3, NPC1, NPC2, PANK2, PARK2, PARK7, PINK1, PLA2G6, PPT1, PSAP, PSEN1, PSEN2, SGSH, SNCA, SORL1, SPTLC1, TARDBP, TPP1, TREM2, TYROBP, VCP and VPS35. We complemented this search by including all of the variants in the Parkinson Disease Mutation Database and the Alzheimer Disease & Frontotemporal Dementia Mutation Database (Cruts et al., 2012). In addition, updated GWAS loci for any traits meeting p-values less than 1E-8 in NHGRI GWAS catalog that were not already on the basic exome content were added to the array if the probe design score for that SNP was > 0.8 (Hindorff et al., n.d., 2009). Also, as part of this phase of array design, special attention was paid to the APOE region with 34 variants being dedicated to genotyping of the canonical epsilon-4 compound genotype. This led to the successful inclusion of 1322 variants. For ALS we also mined variants from a number of databases with several aims. To identify new mutation carriers, we collected from HGMD and ALSOD all mutations in common (C9orf72 excluding repeat expansions, FUS, MATR3, OPTN, SOD1, SPG11, TARDBP, UBQLN2, VCP) and rare ALS genes (ALS2, ANG, CHMP2B, DCTN1, FIG4, SETX, TAF15, VAPB). To identify association signals in and around known ALS genes, we mined 1000 Genome data to identify all multi-ethnic variants with MAF > 0.01 located in common ALS gene bodies +/− 100 kb. We then used Plink to identify haplotype-tagging SNPs (r2 > 0.50). For the ALS/FTD linked gene C9orf72, we mined variants located within the 242 kb Finnish 42-SNP haplotype and +/− 20 kb (Laaksovirta et al., 2010, p. 21). To fine map exonic variation in known ALS genes, we mined 1000 Genome data to identify all multi-ethnic exonic variants with MAF > 0 in common ALS genes.
Array Genotyping
For the pilot analysis used to generate the data presented here, approximately 14,000 samples were genotyped and multiple calling methods tested. Samples tested were derived from a number of sources including DNA from whole blood, EBV transformed lymphocytes, and brain tissue. Genotyping was executed as per the manufacturers protocol (Illumina, Inc). Our genotype calling workflow used a publicly available cluster file for the exome array standard content, which we modified to maximize variant calling for the NeuroX custom content. This was accomplished using a combination of the Illumina GenomeStudio automated clustering algorithm, with manual inspection and modification for the subset of the clusters not included in Grove et al. (2013). As part of the array design process, we excluded a number of variants based on low design quality scores, which allowed us to retain diverse content and maximize the number of successfully typed variants.
In addition, we imputed a random subset of 1,000 European ancestry unrelated individuals from a larger Parkinson’s disease GWAS study that passed quality control after being genotyped on the NeuroX array (Nalls et al., 2014b) using the default settings of MiniMac (Howie et al., 2012). Nonpalindromic SNPs passing quality control and overlapping with those included in the reference haplotypes (1000 Genomes Phase 1 Alpha Freeze version 3, multiethnic panel) were used for imputation (1000 Genomes Project Consortium et al., 2012). This allowed us to densely impute higher variant coverage into regions of interest related to neurological diseases GWAS based on the currently availible content on the array. In addition, we show that the NeuroX array can be used for basic quality control similar to standard GWAS, such as gender checking (evaluating concordance between self-reported and genetically determined genders as part of quality control) or estimating continental ancestry based on appying principal components analyses to common tagging SNPs (Supplemental Figures S1 and S2).
Both NeuroX custom content and the standard HumanExome based content, show that a majority of SNPs across the minor allele frequency spectrum have GenTrain scores > 0.7, suggesting high quality genotype clusters are readily available (Figure 1). As expected, lower minor allele frequencies (MAF) are associated with slightly lower genotype cluster qualities (p-values < 0.001 from linear regression models across MAF strata in Figure 1 comparing trends in GenTrain as MAF changes). GenTrain scores tend to be only marginally lower for the custom content, which is not entirely surprising, given that genotypes for these variants were clustered and called on a reference of ~14,000 samples as opposed to the ~60,000 samples used to generate the reference cluster file used to call genotypes for the standard exome content.
Figure 1.
For both NeuroX custom content and the standard content included on the array, a majority of SNPs across the minor allele frequency spectrum have GenTrain scores > 0.7, suggesting quality genotype clusters are readily available. Discrepancies across content type are partially due to genotype cluster method differences between the two sets of variants (custom and standard content), but also due to the inclusion of rare and difficult to genotype loci in the custom content of the array.
Custom content on the NeuroX array spans 2,236 megabases of the autosome, only slightly less than the ~2,600 covered by early GWAS arrays on which many previous studies of neurodegenerative disease were based (Nalls et al., 2009). Mean per megabase coverage of the custom content is 10.754 variants per megabase, with a maximum of over 600 variants of interest for fine mapping of particular regions, with a comparative bias towards non-exonic and GWAS derived variants (Table 1, Figure 2). The maximum coverage for the NeuroX custom content occurring in regions of interest up to over five-fold the depth of the standard content in the same region. In comparison, the standard content covers 2,703 autosomal megabases at an average of 87.842 variants per megabase, with maximum coverage of certain exonic regions surpassing 1,000 variants per megabase. The inclusion of tag SNPs within the GWAS-derived custom content in conjunction with standard content variants have facilitated the successful imputation of over 1.2 million SNPs (imputation quality > 0.30). Imputed variants from the NeuroX array cover 2,703 megabases and average 478.400 variants per megabase with a maximum coverage up to over 16,500 variants per megabase in some regions of interest.
Table 1.
Comparison of content across standard and custom content. Data is based on clustering of over 14,000 Parkinson’s disease cases and controls as described in (Nalls et al., 2014). All annotations from ANNOVAR (Wang et al., 2010).
Content type | Custom content | Standard content |
Numebr of variants | 24706 | 242901 |
Variants less than MAF 0.01 (%) | 31.531 | 82.277 |
Variants less than MAF 0.05 (%) | 40.047 | 86.078 |
Variants at MAF 0.05 to 0.50 (%) | 59.953 | 13.922 |
Mean MAF | 0.148 | 0.031 |
Exonic variants (%) | 36.151 | 96.504 |
Nonsynonymous coding variants (% of exonic) | 33.934 | 91.332 |
Figure 2.
Autosomal variant coverage per mega-base for different content classes. Panel A, custom content coverage; Panel B, standard content coverage; Panel C, coverage for successfully imputed variants (imputation quality > 0.30).
As a proof of concept, we accurately tag known rare variants in neurodegenerative disease. For example the p.G2019S mutation in LRRK2 (rs34637584), was confirmed to be completely concordant for over one thousand samples genotyped using the NeuroX array that were also assayed via taqman genotyping(Paisán-Ruíz et al., 2004). APOE genotypes were extracted for a subset of over 2,500 NeuroX assayed samples overlapping with a previous study based on targeted genotyping (Federoff et al., 2012) with only 93% accuracy to tag the APOE epsilon-4 haplotype associated with Alzheimer’s risk. This haplotype is made of two SNPs rs7412 and rs429358. The discordance of APOE haplotypes between NeuroX and TaqMan genotyping was entirely driven by discordance at rs429358, with complete concordance at rs7412. Notably we have identified rs429358 as a low quality variant on NeuroX. Rs7412 is of acceptable quality, with greater than 99% genotype concordance across 5 technical replicates, a success rate mirrored at a majority of redundant sites across the array.
The data presented in this paper unequivocally shows that the NeuroX array is a powerful and reliable tool for the investigation of genetic factors associated with neurodegenerative disorders. While not designed with clinical diagnosis in mind, we believe it will serve as a powerful analytic tool for research purposes and investigation of disease mechanisms. We have shown not only that the content of the array is useful in assaying both rare risk variants and common variability for use in future studies, but also highly valuable in investigating known risk loci in more detail. We fully expect this array to become a starting point to the genetic analysis of neurodegenerative disorders, given its relevant and up-to-date genotyping content as well as its low cost. This custom array is being treated as an on-going venture and is currently being adapted to newer genotyping platforms outside of the standard exome array content described here and tuned for better accuracy and higher quality content, while still maintaining compatibility with current offerings. Additionally, the fact that virtually all samples derived from subjects with these disorders may be screened on the same platform, will provide researchers with tremendous power to perform not only analysis of a single phenotype, but also to compare different disease entities for overlaps or significant differences.
This work was supported in part by the Intramural Research Program of the National Institute on Aging, of the National Institutes of Health, Department of Health and Human Services (project numbers Z01-AG000949-02, under human subjects protocol 2003-077) and by the Wellcome Trust/MRC Joint Call in Neurodegeneration award (WT089698) to the UK Parkinson's Disease Consortium (UKPDC) whose members are from the UCL/Institute of Neurology, the University of Sheffield and the MRC Protein Phosphorylation Unit at the University of Dundee.
Figure S1: Gender differences based on X chromosome genotype distributions for 200 random samples with females in red and males in blue.
Figure S2: Common polymorphisms show accurate estimates of continental ancestry in a subset of PD samples when compared at overlapping variants to samples from HapMap Phase 3 via principal components analysis (International HapMap 3 Consortium et al., 2010).
