Abstract
Genome-wide association (GWAS) analyses have identified susceptibility loci for many diseases, but most risk for any complex disorder remains un-attributed. There is therefore scope for complementary approaches to these datasets. Gene-wide approaches potentially offer additional insights. They might identify association to genes through multiple signals. Also, by providing support for genes rather than SNPs, they offer an additional opportunity to compare the results across datasets. We have undertaken gene-wide analysis of two GWAS datasets; schizophrenia and bipolar disorder. We performed two forms of analysis, one based upon the smallest p-value per gene, the other upon a truncated product of p method. For each dataset and at a range of statistical thresholds, we observed significantly more SNPs within genes (pmin for excess<0.001) showing evidence for association than expected whereas this was not true for extra-genic SNPs (pmin for excess>0.1). At a range of thresholds of significance, we also observed substantially more associated genes than expected (pmin for excess in schizophrenia =1.8×10−8, in bipolar = 2.4×10−6). Moreover, an excess of genes showed evidence for association across disorders. Among those genes surpassing thresholds highly enriched for true association, we observed evidence for association to genes reported in other GWAS datasets (CACNA1C) or to closely related family members of those genes including CSF2RB, CACNA1B, and DGKI. Our analyses show that association signals are enriched in and around genes, that large numbers of genes contribute to both disorders, and that gene-wide analyses offer useful complementary approaches to more standard methods.
Keywords: genetics, association, bipolar, schizophrenia, psychosis
Introduction
Until recently, there have been few undisputed genetic associations to non-mendelian forms of common human diseases, but for many diseases, the advent of genome-wide association (GWAS) technology has recently transformed this position1. The attainment of highly significant associations though GWAS reflects in some cases, the availability of large sample sizes2, for others, for example the HLA locus in rheumatoid arthritis and T1D, the existence of at least some common alleles with a greater than average effect size3. These conditions may not readily be satisfied for most complex disorders, for example psychotic disorders, where the extremely large sample sizes used for some disorders2 are difficult to obtain since diagnosis is laborious and expensive. Moreover, the necessary use of phenotypes entirely defined by symptoms will very likely increase aetiological heterogeneity, and thus the observed correlations between genotypes and phenotypes. Therefore far from having a few common risk genes with higher than expected effect sizes, the observed effect sizes in psychosis might be even smaller than typical for other complex disorders. Moreover, even for those disorders where successes have been legion, the majority of genetic risk remains un-attributed2,4. There is therefore a pressing need for alternative methods for extracting information from GWAS datasets.
GWAS studies to date have focused on single locus tests which are the simplest to generate and to interpret. There are, however, situations where they might not provide most power. Examples include where there are multiple common variants at a locus5, and, for replication studies or meta-analysis, where there are differences in the association signals between populations6. Either of these can give rise to complex patterns of association that might not be reflected by association to the same SNPs in different samples despite apparently reasonably powered samples7,8. Another consideration is that power to detect association might be enhanced by exploiting information from multiple (quasi) independent signals within genes. Although it is unknown to what extent any of the above concerns occur in practice, arguments such as these have led some to advocate the use of gene-wide statistical approaches5.
A final crucial point for GWAS studies is that the likely architecture of genetic risk for the psychoses is a matter of considerable debate. Based upon epidemiology, in most cases, risk likely reflects the co-action of several loci but the approximate numbers of loci involved at the individual or the population levels are unknown, as is the spectrum of allele frequencies and effect sizes9,10. At present there is limited direct molecular genetic evidence that favours the existence of common risk alleles. The observations of multiple genome-wide significant or suggestive linkage signals for both disorders that do not readily replicate between studies but which are not randomly distributed across the genome11,12 is compatible with the existence of multiple risk alleles of small-moderate effect. They are not, however, informative with respect to their allele frequencies. Recent papers describing an enrichment of copy number variants in schizophrenia13,14,15 and an excess of de novo CNV events16 in that disorder have raised the possibility of a significant contribution from rare events, some of apparently high penetrance. Whilst it is not yet clear whether the contribution from CNVs is small or substantial, these findings can be interpreted as supporting the hypothesis that common variation may be less important than has generally been assumed.
The current findings from the few published GWAS studies of schizophrenia and bipolar disorder, along with the data from leading candidate genes that predate the era, are supportive of the hypothesis that common variants contribute. However, no locus has yet been reported for schizophrenia that in any single or combined study reaches genome-wide levels of significance17. For bipolar disorder, there are 3 such loci18,19 but these have yet to receive support in independent studies. The issue of whether there is wide-scale involvement of common variants is not moot; if the vast majority of genetic risk is conferred almost exclusively by rare alleles, approaches based upon indirect genetic association may not be very informative. Given the challenges in obtaining appropriate sized samples to conduct GWAS studies, it would be preferable to have strong evidence as to whether doing so is likely to be rewarded with a significant degree of success.
In the present study, we have investigated the potential advantages of assessing gene-wide significance in two GWAS datasets, one of schizophrenia20 the other of bipolar disorder3. Specifically, we aimed to determine if gene-wide analyses resulted in evidence for a marked excess of genes surpassing various thresholds of evidence for association, a finding that would be compatible with a common disease-common variant hypothesis and supportive of more intensive GWAS endeavours. Our secondary aim was to identify at least suggestive evidence for multiple susceptibility genes for each disorder that might support earlier findings, and inform follow up genetic studies. Also, given the hypothesis of overlap in genetic risk between the two disorders21 we wished to determine if there is overlap in the identity of associated genes.
Materials and methods
GWAS datasets
The Bipolar dataset was reported by the Wellcome Trust Case Control Consortium3 and consists of 1868 cases and 2938 controls typed with the GeneChip® 500K Mapping Array Set. The UK schizophrenia cases (n=479) were not part of that study but were typed contemporaneously with the WTCCC samples using the same pipeline20. The full details of the samples and methods for conduct of the GWAS studies are provided in the respective manuscripts. To make our analysis as conservative as possible, we only included autosomal SNPs which passed more stringent quality control criteria than used by the WTCCC, and, unlike that study, additionally corrected all p-values for inflation in the test statistics (see statistical section). Thus we excluded SNPs with Hardy-Weinberg equilibrium p< 0.001 in controls or p< 0.00001 in cases, with minor allele frequencies (MAF) < 0.01 in each of cases and controls, or with call rates<0.97.
SNP assignment
SNPs were assigned to genes if they were located within the genomic sequence corresponding to the start of the first and the end of the last exon of any transcript corresponding to that gene. Functional elements are not restricted to this region but we used this since any other definition is arbitrary. The chromosome and location for all currently known human SNPs and genes and their identifiers was taken from the human genome assembly build 36.2 of the National Center for Biotechnology Information (NCBI) database. All known SNPs and their corresponding chromosomal locations were obtained from the Chromosome Reports data for Taxonomic ID 9606 (i.e. humans) available from NCBI’s dbSNP. These data were downloaded for chromosomes 1 to 22 providing information on RefSNPs and chromosome coordinates. The second data source (seq_gene.md) was also downloaded from the NCBI’s Genome database giving information on Gene ID, gene names, and their start and end position on a chromosome. For the purpose of identification of SNPs in genes we mapped all the SNPs to genes defined by the start and end positions using database techniques. The resulting output file provided information on SNPs for chromosomes 1 to 22 and the genes in which they are placed. From the chromosome reports data, only reference sequence entries were used. The entries for ‘Celera’ sequence were ignored. In the seq_gene.md file also, only reference sequence entries for genes with Taxonomic ID of 9606 were used. The entries for ‘Celera’ sequence and entries of gene types such as ‘PSEUDO’, ‘CDS’, ‘RNA’ and ‘UTR’ were also ignored from this file.
Where a SNP lay within the boundaries of more than one gene, that SNP was arbitrarily assigned to the gene which included the lowest base number on the chromosome assembly. Arbitrary decisions of this nature and errors in SNP/gene assignment present in the databases have no biological validity, and therefore it is anticipated they will generally generate noise and reduce the power. In total, we retained 145,344 (38.5% of the total) SNPs which annotated 13,098 unique genes (1 - 799 SNPs per gene).
Statistics
We used the Armitage trend test (1df) to generate SNP association p-values. In the bipolar and schizophrenia datasets, the association test statistics for the markers we used here are respectively inflated as estimated by the genomic control22 metric λ by 1.11 and 1.08. In neither dataset does this appear attributable to population structure since analyses conditional on the principal components derived from multi dimensional scaling had negligible impact3,20. Nevertheless, to address our aims conservatively, all SNP p-values in the true datasets are corrected for λ. The corrected p-values were then in turn used to generate three types of gene-wide tests. The first was based on the smallest SNP p-value per gene, and the second and third were respectively the threshold truncated23 product of all SNP p-values within a gene where the truncation thresholds were p≤0.01 and p≤0.001. Product analyses were restricted to those genes which had more than 1 SNP.
To calculate empirical gene-wide significance for each gene, we performed 1000 genome-wide permutations for each GWAS data set. For each gene in each permutation we obtained the smallest p and the product of p values as for the original dataset. We then calculated the three empirical p-values for each gene in the observed data by determining the proportion of permuted datasets where the corresponding p-value obtained for each gene was equal to or smaller than was observed in the true dataset.
Obvious disadvantages to a gene-wide approach are that we do not know the boundaries of genes, that the presence of linkage disequilibrium (LD) means that association to a physical location may not point to the particular co-terminous functional element, and many important signals will be contained within functional elements that do not correspond to genes. Nevertheless, it seems a reasonable first assumption that SNPs assigned to known functional elements (genes) would have a higher probability of being associated with disease than the remaining SNPs, even though some of the latter will span unknown functional elements. In order to test this, we counted the total number of SNPs designated by us as within genes surpassing nominal thresholds of p≤0.05, p≤0.01, and p≤0.001 in the observed dataset and also the total number of those not designated by us as within genes. We then compared the observed numbers with the null distributions for each as determined from the permutation datasets.
We calculated the significance of the excess number of genes attaining the specified thresholds in two main ways. The first was empirically, the second was based upon the assumption that, under the null hypothesis of no association, the number of significant genes in a scan is a normally distributed random variable whose mean and standard deviations can be obtained from the permutations. Given computational restrictions required by genome-wide permutation, we could only perform 1000 permutations and therefore based on the first method, significance can be reported only up to a threshold p>10−3. These were calculated by bootstrapping, as follows. From 1000 permutations, we choose a replicate at random as a ‘true study’. We then calculated the significance for each gene, and therefore the total number of significant genes, by comparing the ‘true study’ with 999 replicates obtained by sampling at random (with replacement) from the remaining 999 permutations. This process was repeated 1000 times. The empirical test of the number of significant genes being higher than expected under the null hypothesis of no association was carried out by comparing the observed number of significant genes to the empirical distribution. The calculated level of significance was based upon the distributions of the number of significant genes under the null hypothesis which formally passed the test for normality (in all instances both the skewness and kurtosis coefficients were between - 0.5 and 0.5).
The bipolar and schizophrenia datasets share the same controls, and therefore the test statistics are correlated. This means that we cannot calculate whether genes are associated to both disorders at a rate greater than chance simply based upon assumptions of independence, for example the hyper-geometric distribution. Instead we used permutations. We pooled together the three groups of individuals (bipolar and schizophrenia cases, and controls) and randomly assigned diagnostic category keeping the numbers of individuals in each group equal to that in the observed data. We assessed genes by comparing one randomly selected permutation (of 1000) with the 999 permutations randomly drawn (with replacement) from the pool of remaining permutations. From this we generated lists of the top associated genes that were equal in length to the corresponding list of genes for the observed datasets and then recorded the number of overlapping genes. This process was repeated 1000 times. We then calculated empirical p-values by counting how many times the simulated number of overlapping genes was greater than, or equal to, the observed number of overlapping genes in the true datasets.
Results
In table 1, we present the numbers of SNPs in the observed data surpassing various nominal thresholds for α within and external to genes, the mean and the standard deviations of the same from the permuted data, and the empirical p values (2-tailed) for the null distribution in the observed dataset. For schizophrenia, there is a highly significant excess of SNPs within genes at the two more stringent thresholds whereas this was the case for bipolar disorder only at the intermediate threshold (P<0.01). For neither disorder did we observe a significant excess of associated SNPs located beyond the boundaries of genes. These results suggest that focussing on genes enriches for detectable association signals (in samples of the size we have utilized) providing support for a gene-centric approach.
Table 1. Comparison of SNPs by genic location in schizophrenia (SZ) and bipolar (BD) datasets.
SZ | ||||||||
---|---|---|---|---|---|---|---|---|
Number of SNPs in genes | Number of SNPs outside genes | |||||||
α | Observed | Permuted | p | Observed | Permuted | p | ||
mean | SD | mean | SD | |||||
0.05 | 7480 | 7243.4 | 180.8 | 0.098 | 11534 | 11594.0 | 244.4 | 0.604 |
0.01 | 1688 | 1447.7 | 75.5 | <0.001 | 2381 | 2313.9 | 101.5 | 0.256 |
0.001 | 222 | 147.5 | 22.9 | 0.002 | 266 | 234.7 | 29.2 | 0.143 |
BD | ||||||||
---|---|---|---|---|---|---|---|---|
Number of SNPs in genes | Number of SNPs outside genes | |||||||
α | Observed | Permuted | p | Observed | Permuted | p | ||
mean | SD | mean | SD | |||||
0.05 | 7647 | 7261.18 | 190.73 | 0.19 | 11863 | 11602.26 | 246.86 | 0.135 |
0.01 | 1722 | 1451.03 | 78.47 | <0.001 | 2354 | 2314.51 | 104.39 | 0.352 |
0.001 | 188 | 144.01 | 22.94 | 0.055 | 227 | 231.60 | 29.19 | 0.548 |
The results of the gene-wide analyses for schizophrenia and bipolar disorder are shown in tables 2 and 3. We provide the observed numbers of genes surpassing various thresholds of α in the observed data. The expectations under the Null hypothesis (based upon permutation) are described by the mean and the SD. For the product of p, the thresholds at which the data were truncated are given and the resultant product p-value significance threshold in the α column. In the final two columns, we present p-values (empirical and calculated) for the observed data under the 1-tailed null hypothesis of no excess of genes surpassing the thresholds. We use 1-tailed tests since the observation of fewer significant genes than chance has no biological meaning. In each disorder, there is a significant, or highly significant, excess in the observed number of genes surpassing virtually all thresholds regardless of the method for calculating gene-wide significance. The data from these tables are depicted graphically in Figure S1.
Table 2. Over-representation of significant genes in schizophrenia.
Method of gene assessment | α | Number of genes surpassing significance level (α) | p-value for observed (1-tailed) | |||
---|---|---|---|---|---|---|
Permuted | Observed | |||||
mean | SD | Empirical | Calculated | |||
Smallest p-value per gene | 0.05 | 652.4 | 31.7 | 703 | 0.059 | 0.056 |
0.01 | 135.1 | 13.3 | 175 | 0.002 | 0.001 | |
0.001 | 20.2 | 4.8 | 41 | <0.001 | 1.4×10−6 | |
Product of p truncation ≤ 0.01 | 0.05 | 362.4 | 20.6 | 425 | <0.001 | 0.0012 |
0.01 | 106.1 | 11.1 | 140 | 0.002 | 0.0011 | |
0.001 | 15.6 | 4.2 | 39 | <0.001 | 1.8α10−8 | |
Product of p truncation ≤ 0.001 | 0.05 | 77.9 | 9.2 | 98 | 0.023 | 0.014 |
0.01 | 50.4 | 7.3 | 75 | 0.001 | 0.0004 | |
0.001 | 13.9 | 4.1 | 30 | 0.001 | 3.9×10−5 |
Table 3. Over-representation of significant genes in bipolar disorder.
Method of gene assessment | α | Number of genes surpassing significance level (α) | p-value for observed (1-tailed) | |||
---|---|---|---|---|---|---|
Permuted | Observed | |||||
mean | SD | Empirical | Calculated | |||
Smallest p-value per gene | 0.05 | 660.2 | 30.8 | 790 | <0.001 | 1.3×10−5 |
0.01 | 136.8 | 13.2 | 197 | <0.001 | 2.4×10−6 | |
0.001 | 20.5 | 5.1 | 41 | <0.001 | 2.9×10−5 | |
Product of p truncation ≤ 0.01 | 0.05 | 365.2 | 21.2 | 445 | <0.001 | 8.6×10−5 |
0.01 | 106.8 | 11.8 | 159 | <0.001 | 4.5×10−6 | |
0.001 | 16.1 | 4.2 | 30 | 0.005 | 0.0005 | |
Product of p truncation ≤ 0.001 | 0.05 | 76.8 | 9.7 | 94 | 0.039 | 0.038 |
0.01 | 50.0 | 8.0 | 63 | 0.064 | 0.053 | |
0.001 | 14.1 | 4.3 | 24 | 0.018 | 0.010 |
In Table 4, we present the results of our analysis of genes that simultaneously exceed various thresholds of α in the schizophrenia and bipolar datasets. The first two data rows labelled schizophrenia (SZ) and bipolar (BD) give the numbers of genes observed in the respective GWAS datasets at thresholds of α depicted at the top of each row (extracted from the previous two tables). The third row gives the number of genes common to both datasets at these thresholds. The bottom section of the table summarizes the null expectations based on the permutation tests. The distributions of the number of significant genes common to schizophrenia and bipolar disorder in the permuted data are not normal and therefore to characterise the distribution, we present the median, minimum, and maximum of the numbers of overlapping genes in the permuted data. We also present empirical p-values for observing the true dataset under the null. P-values are two-tailed since observing fewer genes across disorders than expected by chance would have a biological meaning as risk for one disorder could in principle reduce risk for the other.
Table 4. Number of significant genes observed to overlap in schizophrenia (SZ) and bipolar (BD) datasets.
Smallest p-value per gene | Product of p truncation ≤ 0.01 | Product of p truncation ≤ 0.001 | |||||||
---|---|---|---|---|---|---|---|---|---|
α=0.05 | α=0.01 | α=0.001 | α=0.05 | α=0.01 | α=0.001 | α=0.05 | α=0.01 | α=0.001 | |
SZ | 703 | 175 | 41 | 425 | 140 | 39 | 98 | 75 | 30 |
BD | 790 | 197 | 41 | 445 | 159 | 30 | 94 | 63 | 24 |
Overlap | 85 | 14 | 1 | 37 | 11 | 2 | 5 | 2 | 0 |
Number of overlapping genes (1000 permutations with shared controls) | |||||||||
Median | 60 | 5 | 0 | 33 | 4 | 0 | 2 | 1 | 0 |
Min | 27 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 |
Max | 93 | 16 | 5 | 60 | 18 | 5 | 8 | 7 | 3 |
empirical p-value * | 0.014 | 0.007 | 0.402 | 0.305 | 0.017 | 0.09 | 0.063 | 0.415 | NA |
comparing overlap on observed and simulated data
At the more stringent thresholds (p<0.001), none of the tests were significant although there were a total of three observations in common to both disorders compared with none in the permuted data. At other thresholds, there was a significant excess or a trend to excess of genes in common, with strongest support coming from the smallest p method.
For each of the schizophrenia and bipolar datasets, in tables S1 and S2, we present information about the identity of genes surpassing any of the thresholds corresponding to (best or product) p<0.001. For this purpose, we took genes which met any of the thresholds corresponding to p≤0.01 in tables 2 and 3 and undertook 100,000 permutations. This is because specific genes assigned p>0.01 should be fairly accurately specified by 1000 permutations whereas those where the gene-wide p<0.01 may not be. We chose to present the identity of genes surpassing the thresholds p<0.001 as at this threshold, the ratio of the number of observed genes to that expected under the null is substantially greater than 1, suggesting that any one gene is more likely to represent a true than a false observation. It is nevertheless important to note that the evidence for any one gene is not compelling and the specific findings requires confirmation in additional samples. Also, the close proximity of a number of the genes suggests that there are instances where more than one gene derives from the same association signal (note this phenomenon will inflate the estimated numbers of signals in both observed and permuted datasets and therefore does not impact on the main findings). Table S3 shows those genes which in each of schizophrenia and bipolar disorder surpassed thresholds of smallest p<0.01 or where for either truncation threshold, the product of p was ≤0.01 as these thresholds also correspond to a considerable degree of enrichment in the observed compared with the simulated data.
Discussion
GWAS analyses have identified susceptibility loci for many complex diseases, but the majority of risk for any disorder remains un-attributed. Obtaining sufficient samples to extract a high proportion of that component of disease risk attributable to common variants is unlikely to realized in the near future, and this is particularly true for psychiatric disorders where sample collection is particularly challenging. There is therefore scope for complementary approaches to GWAS datasets.
In the present study, we sought to apply one such approach based upon gene-wide analysis which offers a number of possible advantages over single locus tests5. First, if there is more than independent source of an association signal within a gene, for example where there is more than one functional variant, combining these into a single statistic might offer enhanced power over single SNP analysis. This is the main rationale underpinning our use of the truncated product approach23. Second, where there are true differences in the associated SNPs between studies, as might occur as a result of allelic heterogeneity, LD heterogeneity, or where this occurs simply as a result of the sometimes unpredictable nature of LD7,8, consistency of associations across studies may be easier to achieve at the gene-wide than at the SNP level. While such association may not provide a firm basis for implicating specific susceptibility variants, the identification of replicated associations can be of value in identifying candidate genes for further intensive genetic investigation and also for generating hypotheses concerning aetiology and pathophysiology. Both of the methods we have applied here, best corrected p-value and truncated product, confer this potential advantage. They are also both are applicable not just to genes but also to other definitions of functional units, for example groups of genes with related functions.
One caveat is that a gene-wide approach requires at least some of the true association signals to be located within the (arbitrarily) derived boundaries of genes. Our analysis showing an excess in observed association signals within genes but not in sequences beyond gene boundaries supports the view that such a gene-centric enrichment does occur, thereby providing an rationale for targeting the immediate vicinity of genes for analysis. This may be intuitively obvious by analogy with simpler genetic disorders, but with respect to complex diseases, there has been little empirical support for such a strategy. Analogous analyses in multiple phenotypes with genes defined incrementally by adding additional flanking sequence might better define optimal (on average) locations for such endeavours.
Concerning the primary hypothesis, for each disorder, we obtained significant or highly significant evidence for an excess of associated genes at most thresholds. This supports the general validity of the gene-wide approach for detecting true association signals. However, it does not suggest that gene-wide tests make single locus tests redundant; rather that they are useful additional approaches. By way of illustration, we plot the rank order achieved by genes in the schizophrenia dataset (the figures for bipolar are similar) based upon uncorrected single SNP tests against that achieved with each of the gene-wide tests (Figure 1). As expected, the rank order of genes based upon the most significant uncorrected SNP correlates (Spearman r=0.8) with that obtained after correction for the number of independent SNPs in that gene (Figure 1a), but the correlation is much less (r=0.37) between the rankings achieved by the single SNP test and that obtained by the truncated-product approach (Figure 1b). Thus, the approaches are complementary in highlighting the likely involvement of specific genes although the relative merits of each in doing so necessarily awaits the reporting of many more confirmed associations.
That we found significant excess of associated genes at most thresholds provides molecular genetic evidence that has been lacking for the existence of substantial numbers of detectable common alleles of small effect in each of the disorders we tested. Given an assumption of relatively low power to detect genes (which is difficult to estimate given the variation in numbers of SNPs/gene and the number of independent signals/gene, each of different effect size and allele frequency) the excess of associated genes we have observed is likely a small proportion of the total. These data therefore support an important polygenic contribution to liability to both schizophrenia and bipolar disorder. Such a contribution to schizophrenia has previously been suggested by attempts to model the expected prevalence of the disorder among relatives of probands based upon various modes of transmission24,25. We also note our data are incompatible with the hypotheses recently advanced elsewhere that there is only a single schizophrenia risk gene or indeed that all risk is epigenetic26. The existence of multiple genes of small effect suggests that the collection and application of larger samples should deliver many more associations, with the caveat that casting widely for such samples is done in such a way as to avoid the negating effect on power of increasing the (as yet unknown) degree of genetic heterogeneity within diagnostic categories27.
Our secondary objectives were to use the data to identify a number of putative susceptibility genes and also to determine if there is overlap between schizophrenia and bipolar susceptibility. Taking the latter first, although the findings are less robust than the main results, at several thresholds, we observed an excess of associated genes common to schizophrenia and bipolar disorder. These findings support the specific hypothesis that some genes influence risk beyond traditional diagnostic boundaries. That only a small proportion of genes overlapped should not be taken as indicative that this provides an estimate of the extent of shared risk; the small effect sizes inevitably lead to low power for one sample to ‘replicate’ findings observed in the other. Similar considerations may explain why the excess in overlap was observed at the weaker thresholds of significance; any one gene detected in one study is unlikely to be replicated at similar levels of significance in another2,6. Much larger studies will be required to explore and characterize the extent of the overlap.
With respect to the identity of specific genes, there are important caveats that have to be borne in mind. First the strong statistical evidence in this study is for an excess representation of genes rather than for any individual gene per se. Second, association formally implicates regions, not genes, and in some cases, a single ‘true’ signal may, by LD, result in multiple genes being implicated. However, given the ratio of observed associated genes surpassing the thresholds to those expected, many of those doing so and reported in Supplementary Tables S1 and S2 are likely to represent true associations. A third caveat concerns the joint liability between schizophrenia and bipolar disorder. Our simulation procedures explicitly allow for the use of the same set of controls in the schizophrenia and bipolar GWAS analyses, and therefore our conclusion regarding an excess number of genes showing evidence for association across the two datasets is valid. However, the impact of using shared controls on the association evidence for any specific gene cannot be assessed by those procedures. Additional caution is therefore required with respect to the identities of the specific genes that appear to operate across both disorders (table S3).
It is worth pointing out that the failure of a gene to appear in any of the association lists should also be viewed with caution. As is the case for single SNP analysis, power and gene coverage mean that in a moderate sized GWAS study, however it is analysed, failure to find evidence for association is not strong evidence for absence of involvement of that gene27. In that context, we simply note that the present analyses provides no support for dysbindin (DTNBP1), neuregulin1 (NRG1), D-amino acid oxidase activator (DAOA), disrupted in schizophrenia 1 (DISC1), or brain derived neurotrophic factor (BDNF), genes which prior to the advent of GWAS studies were among the most prominent candidates for either disorder3.
We limit additional comment to genes of particular interest based upon existing data. In the gene-wide analysis of schizophrenia, as in the single SNP analysis of the same dataset20, we identified NOS1, RPGRIP1L, OPCML, TMEM108, and SIL1 as genes of potential interest. In bipolar disorder, we identified DPP10, RNPEPL1, CMTM8, DFNB31, LAMP3, TDRD9, PALB2, CDC25B, CAPN6 all of which had at least one polymorphism within the reference sequence that was associated at p<10−5 (3). Other than to note that the use of gene-wide tests flagged up many of the putative hits in the original studies, we do not discuss those genes any further since they have been considered in the original manuscripts.
From the perspective of schizophrenia, there are no published genes that meet criteria for genome-wide significant association. Indeed other than the data previously reported from the SNP based analysis of the present dataset20, there is only one finding from the GWAS literature that meets the criterion for strong evidence for association (p<5×10−7) used by the WTCCC3. Although that criterion was suggested on the basis of large samples, it is nevertheless interesting that CSF2RA encoding colony stimulating factor 2 receptor, alpha subunit was reported as a potential susceptibility gene (p=3×10−7) based upon the first (small) GWAS study of schizophrenia28. Among the genes we observed of interest in bipolar disorder was CSF2RB (table S2; gene-wide pmin =3.5×10−4). CSF2RB is one of only two CSF2R genes and encodes the beta-subunit with which the other, CSF2RA, forms a functional hetero-dimer. CSF2RB itself has been associated with schizophrenia in a case-control and a family based association sample29 and in addition to its role in forming the CSF2 receptor, it is also a subunit for the interleukin receptors 3 and 5. The present finding therefore clearly warrants attention in additional samples and suggests the possibility that the neuroimmunological hypothesis that has been advanced for schizophrenia might be also relevant to bipolar disorder30.
In addition to the data previously reported from the SNP based analysis of the present bipolar dataset, three genes have been reported that meet criteria for genome-wide significance in bipolar disorder; ANK3 (ankyrin 3, node of Ranvier (ankyrin G)), CACNA1C (calcium channel, voltage-dependent, L type, alpha 1C subunit)19 and DGKH (diacylglycerol kinase, eta)18. In the present study, we found support for one of these as well as in family members of two of them. CACNA1C was not identified as a gene of particular interest when the WTCCC bipolar SNP data were examined3 although those data did subsequently contribute to the meta-analysis19. In the present study, at the level of the gene, the best corrected SNP p value in the bipolar analysis was unimpressive (p=0.037) whereas the product-method successfully identified this gene as a potential candidate (table S2; gene-wide pmin =7 ×10−4). Based upon a meta-analysis p value of 7×10−8 (19) CACNA1C can now be considered very likely to be a true positive. Interestingly, mutations in this gene are already known to cause Timothy Syndrome, a disorder whose features include autistic traits (see19). Thus, this gene, as well as the other genes whose products encode α1 subunits (which form the ion pore) of voltage-gated calcium channels, are candidate genes for neuropsychiatric disorders. It is of considerable interest that CACNA1B, encoding the subunit typical of the calcium channels which control neurotransmitter release from neurons and one of only 8 CACNA1 family members represented in our analysis, showed evidence for association (truncated at 0.001, p=0.002) in schizophrenia. This is at a level that just missed our threshold for inclusion in table S1, but is still in the range that is considerably enriched for observed signals compared with the null (p=0.0004). Thus, the calcium channelopathies postulated in bipolar disorder may also operate in schizophrenia.
DGKH was previously implicated in bipolar disorder at genome wide significance level, p=1.5×10−8 (18). Diacylglycerol kinases are central to an enormous range of signal transduction pathways of potential relevance to neuropsychiatric disorders18. While we did not observe evidence for DGKH, DGKI, encoding diacylglycerol kinase iota and one of only 8 family members represented in this analysis, gave gene-wide evidence for association at a level surpassing the p=0.001 threshold in schizophrenia (table S1; gene-wide pmin=6.7×10−4). Our identification of multiple genes closely related to the handful of genes that have been reported to date by GWAS studies of psychosis points further to the utility of gene-wide analysis of GWAS datasets.
In summary, we present a gene-wide analysis of two GWAS studies, one of schizophrenia, one of bipolar disorder. We show that SNPs within genes are enriched for association signals. We show that the datasets contain substantially more gene-wide signals that surpass nominal significance thresholds than expected by chance, and also but less robustly, that there is an overlap in risk between schizophrenia and bipolar disorder. Genes surpassing thresholds at which there is a considerable enrichment for observed signals are likely to be highly enriched for true associations, and such genes and their family members may form the basis for gene-wide replication studies.
Supplementary Material
Acknowledgements
This study was supported by the Medical Research Council (UK) and by the Wellcome Trust. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for that project was provided by the Wellcome Trust under award 076113.
Footnotes
Web Resources
NCBI database NCBI (http://www.ncbi.nlm.nih.gov/).
NCBI’s dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/)
NCBI’s Genome database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome)
References
- 1.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. Journal of Clinical Investigation. 2008;118(5):1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genetics. 2008;40(5):638–645. doi: 10.1038/ng.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mathew CG. New links to the pathogenesis of Crohn disease provided by genome-wide association scans. Nature Genetics. 2008;9:9–14. doi: 10.1038/nrg2203. [DOI] [PubMed] [Google Scholar]
- 5.Neale BM, Sham PC. The future of association studies: gene-based analysis and replication. American Journal of Human Genetics. 2004;75(3):353–362. doi: 10.1086/423901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ioannidis JP. Non-replication and inconsistency in the genome-wide association setting. Human Heredity. 2007;64(4):203–213. doi: 10.1159/000103512. [DOI] [PubMed] [Google Scholar]
- 7.Terwilliger JD, Hiekkalinna T. An utter refutation of the Fundamental Theorem of the HapMap. European Journal of Human Genetics. 2006;14(4):426–437. doi: 10.1038/sj.ejhg.5201583. [DOI] [PubMed] [Google Scholar]
- 8.Moskvina V, O’Donovan MC. Detailed analysis of the relative power of direct and indirect association studies and the implications for their interpretation. Human Heredity. 2007;64(1):63–73. doi: 10.1159/000101424. [DOI] [PubMed] [Google Scholar]
- 9.Risch N. Linkage strategies for genetically complex traits I: Multilocus models. American Journal of Human Genetics. 1990;46(2):222–228. [PMC free article] [PubMed] [Google Scholar]
- 10.Craddock N, Khodel V, Van Eerdewegh P, Reich T. Mathematical limits of multilocus models: the genetic transmission of bipolar disorder. American Journal of Human Genetics. 1995;57(3):690–702. [PMC free article] [PubMed] [Google Scholar]
- 11.Lewis CM, Levinson DF, Wise LH, DeLisi LE, Straub RE, Hovatta I, et al. Genomescan meta-analysis of schizophrenia and bipolar disorder, partII: Schizophrenia. American Journal of Human Genetics. 2003;73(1):34–48. doi: 10.1086/376549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Segurado R, Detera-Wadleigh SD, Levinson DF, Lewis CM, Gill M, Nurnberger JI, Jr, et al. Genome scan meta-analysis of schizophrenia and bipolar disorder, part III: Bipolar disorder. American Journal of Human Genetics. 2003;73(1):49–62. doi: 10.1086/376547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320(5875):539–543. doi: 10.1126/science.1155174. [DOI] [PubMed] [Google Scholar]
- 14.The International Schizophrenia Consortium Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature. 2008;455(7210):237–241. doi: 10.1038/nature07239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Stefansson H, Rujescu D, Cichon S, Pietiläinen OP, Ingason A, Steinberg S, et al. Large recurrent microdeletions associated with schizophrenia. Nature. 2008;455(7210):232–236. doi: 10.1038/nature07229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M. Strong association of de novo copy number mutations with sporadic schizophrenia. Nature Genetics. 2008;40(7):880–885. doi: 10.1038/ng.162. [DOI] [PubMed] [Google Scholar]
- 17.Dudbridge F, Gusnanto A. Estimation of significance thresholds for genomewide association scans. Genetic Epidemiology. 2008;32(3):227–234. doi: 10.1002/gepi.20297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Baum AE, Akula N, Cabanero M, Cardona I, Corona W, Klemens B, et al. A genome-wide association study implicates diacylglycerol kinase eta (DGKH) and several other genes in the etiology of bipolar disorder. Molecular Psychiatry. 2008;13(2):197–207. doi: 10.1038/sj.mp.4002012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ferreira MAR, O’Donovan MC, Meng YA, Jones IR, Ruderfer DM, Jones L, et al. Collaborative genome-wide association analysis of 10,596 individuals supports a role for Ankyrin-G (ANK3) and the alpha-1C subunit of the L-type voltage gated calcium channel (CACNA1C) in bipolar disorder. Nature Genetics. 2008;40:1056–1058. doi: 10.1038/ng.209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.O’Donovan MC, Craddock N, Norton N, Williams H, Peirce T, Moskvina V, et al. Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nature Genetics. 2008;40:1053–1055. doi: 10.1038/ng.201. [DOI] [PubMed] [Google Scholar]
- 21.Craddock N, O’Donovan MC, Owen MJ. The genetics of schizophrenia and bipolar disorder: dissecting psychosis. Journal of Medical Genetics. 2005;42(3):193–204. doi: 10.1136/jmg.2005.030718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- 23.Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genetic Epidemiology. 2002;22:170–185. doi: 10.1002/gepi.0042. [DOI] [PubMed] [Google Scholar]
- 24.Gottesman II, Shields J. A polygenic theory of schizophrenia. Proceedings of the National Academy of Sciences of the United States of America. 1967;58:199–205. doi: 10.1073/pnas.58.1.199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.O’Rourke DH, Gottesman II, Suarez BK, Rice J, Reich T. Refutation of the general single-locus model for the etiology of schizophrenia. American Journal of Human Genetics. 1982;34:630–649. [PMC free article] [PubMed] [Google Scholar]
- 26.Crow TJ. The emperors of the schizophrenia polygene have no clothes. Psychological Medicine. 2008;38:1681–1685. doi: 10.1017/S0033291708003395. [DOI] [PubMed] [Google Scholar]
- 27.Craddock N, O’Donovan MC, Owen MJ. Genome-wide association studies in psychiatry: lessons from early studies of non-psychiatric and psychiatric phenotypes. Molecular Psychiatry. 2008;13(7):649–653. doi: 10.1038/mp.2008.45. [DOI] [PubMed] [Google Scholar]
- 28.Lencz T, Morgan TV, Athanasiou M, Dain B, Reed CR, Kane JM, et al. Converging evidence for a pseudoautosomal cytokine receptor gene locus in schizophrenia. Molecular Psychiatry. 2007;12(6):572–580. doi: 10.1038/sj.mp.4001983. [DOI] [PubMed] [Google Scholar]
- 29.Chen Q, Wang X, O’Neill FA, Walsh D, Fanous A, Kendler KS, et al. Association study of CSF2RB with schizophrenia in Irish family and case - control samples. Molecular Psychiatry. 2008;13:930–938. doi: 10.1038/sj.mp.4002051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hanson DR, Gottesman II. Theories of schizophrenia: a genetic-inflammatory-vascular synthesis. BMC Medical Genetics. 2005;6:7. doi: 10.1186/1471-2350-6-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.