Abstract
Schizophrenia is a psychiatric disorder with large personal and social costs, and understanding the genetic etiology is important. Such knowledge can be obtained by testing the association between a disease phenotype and individual genetic markers; however, such single-marker methods have limited power to detect genetic markers with small effects. Instead, aggregating genetic markers based on biological information might increase the power to identify sets of genetic markers of etiological significance. Several set test methods have been proposed: Here we propose a new set test derived from genomic best linear unbiased prediction (GBLUP), the covariance association test (CVAT). We compared the performance of CVAT to other commonly used set tests. The comparison was conducted using a simulated study population having the same genetic parameters as for schizophrenia. We found that CVAT was among the top performers. When extending CVAT to utilize a mixture of SNP effects, we found an increase in power to detect the causal sets. Applying the methods to a Danish schizophrenia case–control data set, we found genomic evidence for association of schizophrenia with vitamin A metabolism and immunological responses, which previously have been implicated with schizophrenia based on experimental and observational studies.
Keywords: GBLUP, schizophrenia, complex disease, set test, association test
SCHIZOPHRENIA, a mental disorder affecting ∼1% of the population worldwide, is characterized by hallucinations, delusions, and cognitive deficits. From twin studies, the heritability of schizophrenia has been estimated to (Sullivan et al. 2003). Schizophrenia is a complex disorder caused by both genetic and environmental factors. High personal and social costs, combined with the desire for better diagnosis and treatment, have led to numerous studies trying to identify the predisposing genetic loci.
Genotyping of several thousand affected and nonaffected individuals has formed the basis for numerous genome-wide association studies (GWAS), and several genome-wide significant genetic markers have been identified (International Schizophrenia Consortium 2009; Schizophrenia Psychiatric Genome-Wide Association Study Consortium 2011; Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014). Evidence collected across studies points toward the susceptibility to schizophrenia being highly polygenic, with thousands of independent common variants with small effects likely contributing to the risk. Recently, 108 loci were identified in the largest GWAS to date (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014). Around 23–32% of the variation on the liability scale to schizophrenia is captured by common single-nucleotide polymorphisms (SNPs) (International Schizophrenia Consortium 2009; Lee et al. 2012; Ripke et al. 2013). Also, it has been suggested that the genetic risk for schizophrenia is due to a combined effect of common and rare markers (Gratten et al. 2014; Kato 2015; Loh et al. 2015). It is therefore a challenge to pinpoint the genetic etiology of polygenic diseases, when the applied method mainly detects the fraction of genetic markers with the relatively largest effect (Hirschhorn and Daly 2005; McCarthy et al. 2008; Fridley and Biernacka 2011; Wang et al. 2011).
Realizing the limitations of traditional single-marker analysis has promoted the development of alternative, or complementary approaches for GWAS. Adapting ideas from gene expression microarrays (Goeman et al. 2004; Subramanian et al. 2005; Goeman and Bühlmann 2007; Newton et al. 2007), several set test methods have been proposed (Wang et al. 2007, 2010; Holmans et al. 2009; Network and Pathway Analysis Subgroup of the Psychiatric Genomics Consortium 2015). These methods rely on the collective action of multiple SNPs aggregated into defined sets. The aggregation of genetic markers is based on the physical position of the SNPs and their proximity to known genes. Aggregation based on biological networks arises from the knowledge that genes interact in complex networks and cellular pathways (Schadt 2009), and associated markers often are enriched in genes or regulatory regions of genes that are connected in networks and pathways (Allen et al. 2010).
Set test methods examine whether a summary statistic for a set of markers deviates from a summary statistic computed for a random set of markers. Summary statistics can be obtained in several ways (Fridley and Biernacka 2011; Mooney et al. 2014). Commonly, single-marker test statistics are computed for all markers. These measure the association between a marker and phenotype. Subsequently, for a set of genetic markers a summary statistic is computed and to test for enrichment the observed summary statistic is compared to an empirical distribution of summary statistics. The single-marker test statistics can be obtained in several ways, such as from single-marker associations, e.g., logistic regressions (i.e., or P-values), or from multimarker associations, such as the score test approaches (e.g., sequence kernel association test (SKAT)) (Goeman et al. 2004, 2006; Wu et al. 2011; Wang et al. 2013). Alternatively, test statistics could be derived from a “black-box” approach, namely genomic best linear unbiased prediction (GBLUP). This involves solving linear mixed models for fixed and random effects simultaneously and incorporating genomic relationships among individuals, the predicted genetic effects for the individuals are obtained. Here, we attempted to open this black-box modeling approach and propose a new summary statistic for associating sets of SNPs to schizophrenia. We call this new summary statistic the covariance association test (CVAT).
The aim of this study was twofold. First, it aimed to compare the performance of our proposed CVAT to other commonly used set test methods. This was achieved by simulating a population with a disease prevalence corresponding to that of schizophrenia, from which we sampled our study population. Second, it aimed to apply the methods to a previously published Danish schizophrenia case–control study (Børglum et al. 2014).
Methods
To derive CVAT and other summary statistics derived from GBLUP we begin the Methods section by describing GBLUP, and then we derive CVAT. Then we describe other related summary statistics and provide a statistical comparison between some of the methods. The Methods section ends with describing the schizophrenia data and how we simulated the case–control data.
GBLUP
The purpose of this section is to obtain individual genetic effects (), which later will be partitioned into feature-specific genetic effects () when used in the CVAT. The genetic effects are obtained by fitting the linear mixed model,
(1) |
where is a vector of disease status; i.e., 0 = control, 1 = case. and are design matrices linking fixed and random effects to respectively. Here, (I is an identity matrix), thus, can be omitted in the following expressions. is a vector of fixed effects, is a vector of random genetic effects, and is a vector of the residual effects. The genetic effects are assumed to be normally distributed and defined as where is the genomic relationship matrix computed based on all genetic markers: Here, is the number of markers and is a centered and scaled genotype matrix. Each column vector of is where is the minor allele frequency (MAF) of the ith marker, and is the ith column vector of the allele count matrix, containing the genotypes encoded as 0, 1, or 2, counting the number of minor alleles (VanRaden 2008).
Estimates of the fixed () and random effects () are obtained as the best linear unbiased estimator (BLUE) and the best linear unbiased predictor (BLUP), Equations 2 and 3, respectively:
(2) |
(3) |
In Equations 2 and 3, is the estimated phenotypic covariance, given by Single-marker effects can be computed as
(4) |
and the (co)variance of the single-marker effects as
(5) |
To test the null hypothesis of a t-test statistic for was computed as
(6) |
where is the estimate of the variance of the ith element of obtained from the ith element of the diagonal of the covariance matrix of the single-marker effects. Under the null hypothesis, i.e., follows a t distribution with residual degrees of freedom (Cule et al. 2011). is computed as which is equivalent to such that is the total number of observations, and represents the degrees of freedom occupied by the penalized fit (i.e., the linear mixed-model fit). is the hat matrix that transforms into and thus (Nelder et al. 1965).
Predicting the genetic effects in Equation 3 requires the estimated variance component for the genetic effect Here, the variance components are estimated using the Newton–Raphson algorithm. The heritability is defined as the proportion of total phenotypic variation that is due to additive genetic components (Lynch and Walsh 1998), Estimates of from SNP data will give the proportion of phenotypic variation captured by common SNPs; however, for case–control studies such estimates are biased, because the scale at which the phenotype is measured differs from the scale at which the heritability is expressed (Lee et al. 2011). Also, the proportion of cases is larger than the disease prevalence in the population; thus, the heritability will be biased by ascertainment (Lee et al. 2011). To transform the observed SNP heritability to the liability scale and to adjust for ascertainment bias we used a transformation proposed by Lee et al. (2011),
(7) |
such that is the disease prevalence in the population, is the proportion of cases in the sample, and is the height of the standard normal curve at the truncation point; i.e.,
Tests for association between genomic feature and phenotype
In the previous section we obtained a measure of association between single genetic markers and the phenotype (Equations 4 and 6). However, for highly polygenic traits, i.e., traits under the influence of many causal variants with small effects across the genome, detecting causal variants is extremely challenging. Therefore, we argue that methods that exploit the inherent characteristics of complex traits, i.e., many genetic markers with small effect sizes scattered around the genome in a nonrandom fashion, will have increased power to detect sets of markers enriched for causal variants. These methods are commonly known as set tests, and a variety of methods have been developed (e.g., Fridley and Biernacka 2011; Mooney et al. 2014).
Here we describe a new method for associating sets of SNPs to a phenotype, namely CVAT. The CVAT summary statistic measures the covariance between the joint genetic values for markers in the feature set and the genetic values for the remaining markers outside the feature set. A feature set is a collection of markers grouped according to known genomic or biological structures, such as genes, gene ontologies, pathways, etc.
Following the derivation of CVAT we describe three commonly used summary statistics and compare the different methods mathematically. We end this section by defining the null hypothesis and how this is tested.
CVAT:
The CVAT summary statistic ( Equation 8) captures the covariance between the total genetic effect from all markers and the genetic effect from the markers within the feature set
(8) |
The subscripts f and r signify whether the genetic effects are computed based on markers within the feature set or based on the remaining markers, respectively. contains the contribution from the genetic values of the markers within feature and the covariance between genetic values for the markers in the feature and outside the feature set. is derived from the explained sum of squares
(9) |
The first term in Equation 9 constitutes the part of that is explained by the genetic markers linked to the genomic feature of interest, the second term is the part explained by the markers not included in the feature set, and the last term is the covariance between the genetic values for the two marker sets. The first two terms will always take positive values, whereas the covariance term can be either negative or positive. Large values of indicate a genomic feature contributing strongly to Because of LD may contribute to It may be helpful to rewrite the total genetic effects (Equation 3). When we get
(10) |
Using and the two terms in (Equation 8) can be written as
(11) |
and are the observed covariance between the adjusted phenotype and the markers within the feature set and the markers not included in the feature set, respectively. is a correlation matrix quantifying the observed linkage disequilibrium (LD) between all markers. The matrix can be partitioned into submatrices accordingly to the genetic variants within or outside the genomic feature:
(12) |
The block diagonals, and harbor the correlation among markers in the genomic feature and among the markers not included in the feature set, respectively. The off-diagonal element, is the correlation between the markers in and outside the feature set.
The predicted genetic effects are assumed to be drawn from the same distribution, namely It is likely, though, that the genetic effects come from a mixture of distributions, e.g., are caused by differences in MAFs (Loh et al. 2015). Therefore, we extend CVAT, allowing the genetic effects to have different distributions caused by different underlying partitioning. This is obtained using our genomic feature BLUP (GFBLUP) in which an additional variance component is added to the traditional GBLUP (Edwards et al. 2016; Sarup et al. 2016):
(13) |
The notation is similar to Equation 1 except and are vectors of genetic values captured by genetic markers linked to different subsets. The random genetic and residual effects were assumed to be independent and distributed as and The genomic relationship matrices and are computed based on the subset of markers, and Here we illustrate this by partitioning using different MAF boundaries, e.g., < MAF 0.05, and > MAF 0.05 (see Figure 4 and Table 1 for the boundaries used for the simulated and observed data). The two genetic effects for that is, and are then based on the genetic marker effects now arising from two distributions:
Figure 4.
Comparing performance of CVAT when using a mixture of SNP effects. Shown is the performance of the five methods [covariance association test (CVAT); Score test; the sum of squared t-test statistics on marker effects; the sum of the squared marker effects; and the number of individual markers with a P-value below the threshold] when applied to a data set enriched for rare, causal variants. The subscripts MAF indicate that the summary statistic was computed based on GFBLUP (Equations 21 and 22), utilizing MAFs as a feature (MAF 0.05, 0.10, 0.15, and 0.20). Performance was measured by the F1 score across the dilution effect. The color grading indicates the proportion of genetic variance explained by the causal markers in
Table 1. Summary of associated genomic features for the six GBLUP-derived set tests.
Gene ontologies | CVAT | Score | CVATMAF05 | Description | |||
---|---|---|---|---|---|---|---|
GO:0050908 | 0 | 0 | 0.028 | 0.04 | 0.044 | 0 | Detection of light stimulus involved in visual perception |
GO:0008138 | 0 | 0 | 0 | 0 | 0.017 | 0 | Protein tyrosine/serine/threonine phosphatase activity |
GO:0021983 | 0 | 0 | 0 | 0.0089 | 0 | Pituitary gland development | |
GO:0001568 | 0.043 | 0.043 | 0.0089 | 0 | Blood vessel development | ||
GO:0006776 | 0.028 | 0.043 | 0 | 0 | Vitamin A metabolic process | ||
GO:0042572 | 0.028 | 0 | 0 | 0 | Retinol metabolic process | ||
GO:0042573 | 0.028 | 0 | 0 | 0 | Retinoic acid metabolic process | ||
GO:0042574 | 0 | 0 | 0 | 0 | Retinal metabolic process | ||
GO:0060324 | 0.043 | 0.04 | 0 | 0 | Face development | ||
GO:0043403 | 0.028 | 0 | 0.038 | 0 | Skeletal muscle tissue regeneration | ||
GO:0010628 | 0 | 0 | 0 | 0 | Positive regulation of gene expression | ||
Immunological signatures | CVAT | Score | CVATMAF05 | Description | |||
GSE17721 | 0 | 0 | 0 | 0 | 0 | Genes upregulated in comparison of dendritic cells (DC) stimulated with Pam3Csk4 (TLR1/2 agonist) at all time points vs. those stimulated with Pam3Csk4 (TLR1/2 agonist) at 24 hr only |
(14) |
Comparable summary statistics:
In the following we describe three commonly used summary statistics. The first type is based on counting the number of markers in the feature set that showed association with the phenotype:
(15) |
In Equation 15, is the number of markers in the feature set, is a test statistic for the ith marker (e.g., a P-value of SNP association), is an arbitrary threshold for the marker test statistic, and is an indicator function that takes the value one if the argument is satisfied, and otherwise it is zero. When defining the value of arbitrariness is introduced. Markers whose test statistic was slightly higher than would not be captured in such a model and therefore not be considered. Such tests have high power to detect association if the feature set harbors genetic markers with large effects, but it will not detect sets with many markers with small to moderate effect sizes (Newton et al. 2007). Therefore, it might be more powerful to use a summary statistic that considers all markers within the feature set. The second summary statistic was based on the sum of the squared test statistics for all markers belonging to a genomic feature:
(16) |
In Equation 16, is the test statistic for the ith marker and is the number of markers in the feature set. We used the squared test statistic, because some types of test statistics can take both positive and negative values.
The third approach is a score test statistic, which is derived from the first derivative of the likelihood as is Rao’s score test (Bera and Bilias 2001). One of the major differences compared to Rao’s score test is that only the quadratic term (in the first derivative) forms the basis of these test statistics (Goeman et al. 2004; Wu et al. 2011; Wang et al. 2013). The argument is that this is the only part that involves the data (Huang and Lin 2013). The score test statistic can therefore be written as
(17) |
In Equation 17 where the subscript f indicates markers within the feature. The score test statistic requires and and these depend on how the null model is specified. The purpose of the null model is to adjust for environmental nongenetic effects and genetic factors not part of the genomic feature. If the null model is specified as the GBLUP (Equation 1), the score statistics can be expressed as
(18) |
where which corresponds to centering the observed values by the fixed effects and scaling by the inverse of the estimated phenotypic covariance. In the case of the simulated data, the null model is while for the observed disease data the null model also includes an adjustment for population structure and sex differences; thus, Therefore, From Equation 18 it can be seen that the score statistic is the covariance between the adjusted phenotypic values and the SNPs within the feature. Thus, close to zero indicates low covariance between SNPs within the feature set and the phenotypic values.
Comparison of summary statistics:
The purpose of this paragraph is to show the similarity and dissimilarity between the summary statistics described. First, we show that and are proportional when GBLUP marker effects ( Equation 4) are used:
(19) |
The result in Equation 19 can also be expressed for the markers within the feature set:
(20) |
Combining the results in Equation 20 with the expression for (Equation 18), we get
(21) |
which is proportional to when single-marker effects are used as test statistics:
(22) |
If the parameters are estimated under the GBLUP model, and can be compared directly. Compared to also considers the covariance among markers within the feature set and markers outside the feature set (Equation 8). Even if the contribution of individual markers to the summary statistic is weighted according to the degree of correlation among the markers (Equations 11 and 12).
Defining and testing :
The null hypothesis we consider is a competitive test; i.e., the degree of association between a particular set of genetic markers and a phenotype is equal to the degree of association to other complement sets, thus, the number of genetic markers and correlation structure among markers should be equal to the set of interest (Goeman and Bühlmann 2007).
To test whether a set of markers displays the same degree of association as a complement set, the observed summary statistic was compared to an empirical distribution of summary statistics. First, the observed test statistics were ordered according to the markers’ physical positions and next, one such element was randomly selected. All elements were then shifted to a new position, such that the one picked was the first element, and the remaining markers were shifted to new positions, but in the same order as originally, i.e.; A new summary statistic was computed based on the original position of the genomic feature. This uncoupled the association between the markers and the genomic feature, but retained the correlation structure among test statistics. The permutation was repeated 10,000 times for each set in the feature class and an empirical P-value was obtained as a one-tailed test of the proportion of the randomly sampled summary statistics being larger than the observed summary statistic.
Schizophrenia case–control data
The method described above was applied to a schizophrenia case–control study previously published (Børglum et al. 2014). The data consist of 1770 individuals born in Denmark between 1981 and 2006: 882 controls and 888 schizophrenia cases with a diagnosis according to the ICD-10 Classification of Mental and Behavioral Disorders Diagnostic Criteria for Research (ICD-10-DCR), F20. Cases and controls were matched by age and sex. The genotypes were subject to quality control, as described in Børglum et al. (2014). The Danish Data Protection Agency and the ethics committees in Denmark approved the human subjects protocol. Using the Variant Effect Predictor from Ensembl (McLaren et al. 2010), the marker positions were updated from human genome assembly hg18, NCBI Build 36 to UCSC hg38. In this process we restricted the genotypes to be autosomal, resulting in a total of 520,897 SNPs.
Genomic feature classes
Numerous genomic feature classes can be constructed. We created sets of genetic markers defined by gene ontologies (GOs) and immunological signatures. Annotated markers were grouped by GOs, using the Bioconductor package “org.Hs.eg.db” (Gentleman et al. 2004; Carlson 2014), and immunological signatures were obtained from the Molecular Signature Database (MSigDB) (Subramanian et al. 2005; Godec et al. 2016).
Simulating case–control populations
To evaluate the performance of CVAT and to compare it to other set test methods, several simulations were established. To create simulations directly comparable to the observed schizophrenia case–control data, we simulated ∼2000 individuals with the proportion of cases:controls of 0.5. Cases were drawn from a population with a disease prevalence (K) corresponding to that of schizophrenia; i.e., Thus, to obtain a total of = 100,000 individuals were simulated and cases were obtained such that which gave ∼1000 cases (varied between simulations). The 1000 controls were sampled from the remaining individuals. We simulated a total of = 80,000 SNPs distributed along 22 autosomes with the same correlation structure as in the human populations (Janss et al. 2012). The phenotypes were simulated based on 1000 causal markers. To obtain marker effects with different effect sizes, the causal variants were divided into two groups : and The variants in were sampled from 20 regions of 25 markers (to imitate genes), whereas the variants in were sampled randomly from the remaining markers. To obtain different effect sizes of the variants in several scenarios were established such that the proportion of genetic variance explained by ranged from 0.05, 0.1, 0.2, 0.3, 0.4, to 0.5. The total phenotypic variation explained by the genetic variation and thus the total heritability was set to the proportion of variance in liability attributable to genome-wide markers estimated from our schizophrenia data set ( see Results). To obtain the same as in the observed schizophrenia data, the phenotypes were simulated with the heritability on the liability scale Thus, after selecting the 2000 individuals we obtained For each scenario a total of 50 data sets were obtained. To investigate the effect of noncausal markers within the causal sets of variants, we added an increasing number of noncausal markers (100, 200, ... , 4900, 5000), to the causal sets, which we refer to as the dilution effect. To determine the false-positive rate, 50 noncausal marker sets of varying sizes (10 sets each containing 100, 500, 1000, 5000, or 10,000 SNPs) were sampled, none of which were contained in the causal variants.
The simulations consisted of five steps (see Algorithm below). First, a vector of MAFs is obtained. To simulate genotypes according to a particular correlation structure between the markers a correlation matrix was simulated (). The jth genotype of each ith individual was simulated as individual alleles [ and ], and the sum of the two alleles produced the genotype (). Each allele was simulated using a multivariate normal distribution, and to convert these to binary genotypes, truncation of the multivariate normal distribution was performed. However, truncation of the normal distribution would change the correlation structure within the binary data, and thus We applied an algorithm proposed by Leisch et al. (1998) such that after the multivariate normal data were simulated based on we obtained the discretized genotypes (0, 1, 2) with the desired Finally, the binary phenotypes () were simulated from a normal distribution as where is a vector of fixed effects, is a vector of genetic effects [ where is the marker effects of the causal variants within and ], and is the residual effect. The binary phenotypes were then obtained using truncation of the inverse of the standard normal distribution on using the disease prevalence.
To illustrate a potential extension of CVAT we performed a second set of simulations. Here, 50% of the causal genetic variants were set to have a MAF to mimic a scenario where the disease is influenced by both common and rare variants (Loh et al. 2015).
Algorithm
Given the vector of MAFs, calculate
For a matrix simulate and
-
-
such that
-
where
Assessing the power of summary statistics:
The performance of the different set tests was assessed using the F1 classification score:
(24) |
The F1 score is the harmonic mean of precision () and recall () (Powers 2011). is the proportion of true positives (TP) that are correctly identified, i.e., the ratio between the number of identified causal sets and the number of sets that should have been identified and thus the sum of TP and false negatives (FN). Contrarily, is the proportion of true positives that truly are positives, i.e., the proportion of true causal sets of all sets identified and thus the sum of TP and false positives (FP). The F1 score can take values between 0 and 1, with maximum performance at the value of 1.
Software implementation
The methods described above have been implemented in the R package qgg, which is available at http://psoerensen.github.io/qgg/.
Data availability statement
To protect the identity of the Danish individuals, the genotypic data generated for the Danish study are not publicly available. Data are available through the Aarhus University Data Access Committee for researchers who meet the criteria for access to confidential data. To request access to the data, contact John Westensee at jwe@au.dk.
Results
Quantitative genomic parameters
Quantitative genomic parameters for schizophrenia were obtained by fitting the linear mixed model as described in Equation 1, in which we accounted for sex and population structure by incorporating the first principal component. The proportion of phenotypic variance captured by common genetic markers was estimated as = 0.29, and accounting for ascertainment bias (Equation 7) we obtained = The heritability on the liability scale was used in our simulations, and across all simulation scenarios we obtained = 0.27.
Lessons from simulations
Single-marker association has been used to study complex traits, but for highly polygenic traits such methods may have limited power to detect causal variants. To illustrate this, we performed a logistic regression on the simulated data. The F1 classification score was used to describe the power to detect the causal variants (Figure 1). This clearly showed limited power to detect causal variants in C1 or C2. The power to detect the main causal effects () increased as the proportion of genetic variance explained by the markers in () increased (Figure 1); however, the power remained very limited, not exceeding an F1 score of 0.15. Thus, given the genetic model of our simulations, using a logistic regression with an arbitrary significance level (P < 0.001), a large proportion of noncausal genetic markers were identified, and a minor proportion of the causal variants were actually identified, altogether resulting in an F1 score <0.05 (Figure 1).
Figure 1.
F1 scores for single-marker association. For each scenario of genetic variance explained by the causal markers in we performed a standard GWAS, i.e., testing individual markers for association to the phenotype using a logistic regression. The F1 scores were computed for each simulated scenario for the causal markers within (blue), within (green), or among all causal markers, (red). Circles indicate the mean F1 score within scenario and error bars are the standard error of the mean.
The main objective of our study was to compare the performance of CVAT to that of other set test methods. Overall, as the proportion of genetic variance explained by increased, the power to detect the causal sets increased (Figure 2). Across the different genetic architectures, CVAT performed equally to the score method (Figure 2). The sum of marker effects and the sum of t-test of marker effects performed at an equal level (Figure 2). Counting the number of markers with a P-value below an arbitrary significance level showed the lowest performance, when the proportion of genetic variance explained by the main causal markers was low (i.e., = 0.05 and = 0.10); however, as increased this summary statistic performed better than both and (Figure 2).
Figure 2.
Comparing the performance of GBLUP-derived set tests. The five methods [covariance association test (CVAT); Score test; the sum of squared t-test statistics on marker effects; the sum of the squared marker effects; and the number of individual markers with a P-value below the threshold] were compared using F1 scores computed across the dilution effect, and thus the F1 score captures the general performance for each method across the dilution range. The color grading indicates the proportion of genetic variance explained by the causal markers in
Adding noncausal markers to the causal sets, i.e., diluting the causal effects, resulted in the same trend across all methods: F1 was reduced as the dilution effect got larger (Figure 3). Also, as increased, F1 increased. At low counting the number of markers below a significance threshold started at a lower F1 score than the other methods, and and tend to have the lowest performance at the largest dilution effect.
Figure 3.
The effect of dilution on F1 score. The panels shows the F1 score for the five methods [covariance association test (CVAT); Score test; the sum of the squared t-test statistics on marker effects; the sum of the squared marker effects; and the number of individual markers with a P-value below the threshold] as a function of adding noncausal markers to the set, i.e., diluting the signal caused by causal markers. Each panel is one of the six scenarios of different proportions of genetic variance explained by the causal markers in
When data were enriched for causal variants with low MAF, extending the CVAT to utilize marker effects with different distributions increased the power considerably (Figure 4). When the proportion of genetic variance explained by the main causal effects was large, the F1 scores were close to its maximum, and thus the largest gain in performance was retrieved at intermediate to low
Gene ontologies and immunological signatures associated with schizophrenia
We applied the six genomic BLUP-derived set tests to the Danish schizophrenia case–control data. To obtain genomic feature adjusted P-values we applied a false discovery rate on all the empirical P-values. As an additional constraint, feature sets were assigned as significant only if four of the six methods resulted in an adjusted P-value <0.05.
We utilized two types of prior biological knowledge to create a group of markers: GOs and immunological signatures. A total of 38 different GOs were associated by at least one method, and 11 GOs were associated by at least four of the methods (Table 1 and Supplemental Material, Table S1). A total of 16 different immunological signatures were associated by at least one of the methods, while only 1 was associated by more than four methods (Table 1 and Table S2). Two main categories of associated GOs could be identified. The first group contained four GOs related to developmental processes (GO:0021983, GO:0001568, GO:0060324, and GO:0043403), while the second group contained GOs related to vitamin A metabolism, including retinol metabolism (GO:0006776, GO:0042572, GO:0042573, and GO:0042574). The three remaining GOs were related to gene expression (GO:0010628), phosphorylation activity (GO:0008138), and light detection (GO:0050908). The immunological signature was GSE17721, which represents dendritic responses to PAM3CSK4 (an agonist to the membrane protein TLR2).
Extending CVAT to utilize marker effects based on the GFBLUP model to account for a mixture distribution entailed more feature sets having a false discovery rate (FDR) < 0.05 (Table 1). Examining the distribution of marker effects from GBLUP and GFBLUP clearly showed a mixture of two distributions (Figure 5). This pattern could be caused by a general enrichment of low MAFs within the associated genomic features; however, this was not observed (Figure 6).
Figure 5.
Distribution of marker effects from GBLUP (gray) and distributions of effect sizes from GFBLUP, allowing marker effects to have different variances. The grouping in GFBLUP was based on MAF (<0.01, purple curve; >0.01, orange curve).
Figure 6.
MAFs within associated genomic features. Gray dots are MAFs of SNPs within each associated feature, while the red horizontal lines indicate the mean MAF within each feature. Horizontal blue dashed line indicates the overall mean MAF.
Discussion
In the present study, we compared the performance of our newly proposed CVAT to a selection of commonly used set test methods. By simulating genotypes and binary phenotypes we investigated the power to detect causal sets of genetic markers. The simulations provided knowledge of how the different methods perform under different genetic scenarios. We applied the six methods to a schizophrenia case–control data set with the aim of identifying genomic features associated with schizophrenia.
From single-marker association to set tests
Despite the success of single-marker analyses, identification of genetic markers associated with highly polygenic traits is more restricted. Markers may be truly associated, but with effect sizes too small to be detected by traditional single-marker association analyses, or fail to pass an arbitrary, stringent genome-wide significance threshold (Hirschhorn and Daly 2005; Wang et al. 2010, 2011). Using simulations we investigated this particular problem and found indeed, that for highly polygenic traits the power to detect causal variants was very limited (Figure 1). Thus, for large-scale genomic data and polygenic traits, there is a need to get past single-marker analysis and move toward methods that take into account the collective action of multiple markers.
Set tests have been suggested as methods that can increase the power to detect groups of genetic markers containing causal variants. Comparing the performance of CVAT to existing methods showed that in all scenarios CVAT performed equally to the score-based approach and that both CVAT and the score approach outperformed the other methods (Figure 2). Summary statistics based on the sum of single-marker test statistics were expected to have more power to detect sets harboring markers with small to moderate effect size, in contrast to the count of the number of “significantly” associated markers within the set (Newton et al. 2007). This was observed at low and as the effect size of the causal markers increased, due to increased the count-based method outperformed the sum methods (Figure 2). This was expected because the count-based method tends to have higher power to detect feature sets harboring markers with large effect sizes (Newton et al. 2007). If and are obtained from the same model, they are proportional and therefore should have the same power (Equations 17–20). Here, and were not based on the same model: was based on GBLUP (Equation 1), while was based on a null model that did not include any genetic components, The disparity in performance may arise because the GBLUP only in part captures the true genetic signal into because it relies on separation of the genetic and residual components, while is based on a response variable containing most of the true genetic signal plus random noise, i.e., and hence no separation of the two components.
Recent findings indicate that low-frequency markers may have larger effect sizes (Loh et al. 2015), thus resulting in a mixture distribution of marker effects. When the simulated data were enriched for causal variants with low MAFs, a large improvement in the power to detect the causal sets was observed (Figure 4). We used a two-component partitioning (Equation 13), fitting two genetic effects jointly. Multiple genetic components can be fitted (Speed and Balding 2014); however, if the correlation between the multiple correlation matrices is high, uncertainties of the variance components increase (Gusev et al. 2014; Edwards et al. 2016), which affects the accuracy of the marker effects. Therefore, using marker effects from multigenetic components might decrease the power of CVAT because the marker effects are estimated with bias. However, this should be investigated further. Other extensions of CVAT can easily be implemented, including dominance effects, genotype-by-genotype interactions, and genotype-by-environmental effects. Since CVAT was derived from GBLUP, extending CVAT to a multitrait model is straightforward.
All analyses performed in our study were based on individual genotypes. Access to individual genotypes may be restricted, and thus methods that utilize published marker test statistics do have advantages. Both the score approach and CVAT require individual genotypes, whereas the count- and sum-based methods require only individual marker test statistics. There is a trade-off between the power of the set test and the availability of genotypes: If the premise is a data set containing both response variables and large-scale genotype variant data, then CVAT, with its many possible extensions, is an attractive method.
Biological insights from set tests
Statistical tests relying on prior information and aggregation of genetic markers based on some criteria other than the data currently being analyzed have the potential to recover more knowledge than single-marker analyses do. The gain in knowledge highly depends on the quality and complexity upon which the aggregation is based. Aggregation based on physical genomic regions, such as chromosomes or single genes, might not increase the information level, but as additional complexity layers are added, such as pathways, the biological interpretation might become easier and more informative. Especially, the bridging between different types of studies is a clear advantage, because the sharing of data could increase knowledge within different fields of science.
A large proportion of the GOs associated across methods were related to metabolism of retinoic acids (including vitamin A, Table 1). Of all associated GOs a total of 9/38 GOs were related to the metabolism of retinol (Table S1). These findings were supported by several independent lines of evidence, such as schizophrenia mouse models with impaired retinoic acid signaling (Lamantia 1999), microarray studies on human brains showing alterations in the whole retinoid cascade (Goodman 2005), and finally that low maternal vitamin A intake was associated with a threefold increase in the risk of schizophrenia in the offspring (Bao et al. 2012). Immunological changes have also been associated with the risk of developing schizophrenia (Maino et al. 2007), which supported our findings of a significant immunological signature (Table 1). In particular, the response to PAM3CSK4, an agonist to the membrane protein TLR2, has been shown to decrease the volume of gray matter in mice (Du et al. 2011): Reduction in gray matter is observed in schizophrenia patients (Pol et al. 2002).
Fewer genomic features were associated with the standard CVAT compared to the other methods; however, when extending CVAT to utilize marker effects arising from a mixture distribution, an increase in the number of associated feature sets was observed. Arguably, this could be due to either an increased power of CVAT or an increased false positive rate. Based on our simulations, we argue that the larger proportion of identified feature sets was due to increased power. In the case of the simulated data enriched for causal variants with low MAF, we saw a reduction in power of the standard CVAT (Figure 4). However, when exploiting the mixture distribution, the power increased considerably (Figure 4). Examining the distribution of SNP effects from the schizophrenia data, it is evident that the marker effects are a mixture of two distributions (Figure 5). This pattern could arise if the associated genomic features in general were enriched for genetic markers with low minor allele frequency. We did not observe this (Figure 6), but instead we observed that the associated genomic features were enriched for causal variants contributing to the disease risk, supporting the suggestions that the genetic risk of schizophrenia is a combination of common and rare variants (Gratten et al. 2014; Kato 2015; Loh et al. 2015).
Conclusion
The aim of the present study was to (1) compare the performance of our GBLUP-derived CVAT to commonly used set test methods and (2) apply the methods to identify genomic features associated with schizophrenia. Overall we found that CVAT performed equivalently to the score method and outperformed the other approaches. By extending CVAT to utilize a mixture distribution of marker effects, the power of CVAT increased significantly compared to the other set test methods. Several other extensions of CVAT are possible, making CVAT an attractive method for future research. Applying the set tests on a schizophrenia case–control data set, we identified several pathways, such as vitamin A metabolism and an immunological signature, which both previously have been functionally associated with schizophrenia. Despite the relatively small sample size, we were able to identify significant sets of markers having biological relevance.
Acknowledgments
This research was supported by the Center for Genomic Selection in Animals and Plants funded by the Danish Council for Strategic Research (contract no. 12-132452), the Faculty of Health Sciences and Centre for Integrative Sequencing at Aarhus University, and the Lundbeck Foundation.
Appendix
The Genomic Medicine for Schizophrenia (GEMS) Group is as follows and affiliations are listed below: Ole Mors (1, 2), David M. Hougaard (3), and Preben Bo Mortensen (2, 4):
Research Department P, Aarhus University Hospital, 8240 Risskov, Denmark.
iPSYCH, The Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8000 Aarhus, Denmark.
Danish Center for Neonatal Screening, Statens Serum Institut, 2300 Copenhagen, Denmark.
National Center for Register-Based Research, Aarhus University, 8000 Aarhus, Denmark.
Footnotes
Communicating editor: N. R. Wray
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.189498/-/DC1.
The Genomic Medicine for Schizophrenia Group members are listed in the Appendix.
Literature Cited
- Allen H. L., K. Estrada, G. Lettre, S. I. Berndt, M. N. Weedon et al, 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467: 832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bao Y., Ibram G., Blaner W. S., Quesenberry C. P., Shen L., et al. , 2012. Low maternal retinol as a risk factor for schizophrenia in adult offspring. Schizophr. Res. 137: 159–165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bera A. K., Bilias Y., 2001. Rao’s score, Neyman’s C and Silvey’s LM tests: an essay on historical developments and some new results. J. Stat. Plan. Inference 97: 9–44. [Google Scholar]
- Børglum A. D., Demontis D., Grove J., Pallesen J., Hollegaard M. V., et al. , 2014. Genome-wide study of association and interaction with maternal cytomegalovirus infection suggests new schizophrenia loci. Mol. Psychiatry 19: 325–333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carlson M., 2014 org.Hs.eg.db: Genome Wide Annotation for Human (R package version 3.0.0). Available at: https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html.
- Cule E., Vineis P., De Iorio M., 2011. Significance testing in ridge regression for genetic data. BMC Bioinformatics 12: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du X., Fleiss B., Li H., D’angelo B., Sun Y., et al. , 2011. Systemic stimulation of TLR2 impairs neonatal mouse brain development. PLoS One 6: e19583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards S. M., Sørensen I. F., Sarup P., Mackay T. F. C., Sørensen P., 2016. Genomic prediction for quantitative traits is improved by mapping variants to gene ontology categories in Drosophila melanogaster. Genetics 203: 1871–1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fridley B. L., Biernacka J. M., 2011. Gene set analysis of SNP data: benefits, challenges, and future directions. Eur. J. Hum. Genet. 19: 837–843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gentleman R. C., Carey V. J., Bates D. M., Bolstad B., Dettling M., et al. , 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5: R80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Godec J., Tan Y., Liberzon A., Butte A. J., Mesirov J. P., et al. , 2016. Compendium of immune signatures identifies conserved and species-specific biology in response to inflammation. Immunity 44: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goeman J. J., Bühlmann P., 2007. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23: 980–987. [DOI] [PubMed] [Google Scholar]
- Goeman J. J., van de Geer S., de Kort F., van Houwellingen H. C., 2004. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20: 93–99. [DOI] [PubMed] [Google Scholar]
- Goeman J. J., van de Geer S. A., van Houwelingen H. C., 2006. Testing against a high dimensional alternative. J.R. Stat. Soc. 68: 477–493. [Google Scholar]
- Goodman A. B., 2005. Microarray results suggest altered transport and lowered synthesis of retinoic acid in schizophrenia. Mol. Psychiatry 10: 620–621. [DOI] [PubMed] [Google Scholar]
- Gratten J., Wray N. R., Keller M. C., Visscher P. M., 2014. Large-scale genomics unveils the genetic architecture of psychiatric disorders. Nat. Neurosci. 17: 782–790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gusev A., Lee S. H., Trynka G., Finucane H., Vilhja B. J., et al. , 2014. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95: 535–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hirschhorn J. N., Daly M. J., 2005. Genome-wide association studies for common diseases and complex traits. Natl. Rev. 6: 95–108. [DOI] [PubMed] [Google Scholar]
- Holmans P., Green E. K., Pahwa J. S., Ferreira M. A. R., Purcell S. M., et al. , 2009. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 85: 13–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y.-T., Lin X., 2013. Gene set analysis using variance component tests. BMC Bioinformatics 14: 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International Schizophrenia Consortium , 2009. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janss L., G. delos Campos, N. Sheehan, and D. Sorensen, 2012. Inferences from genomic models in stratified populations. Genetics 192: 693–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kato T., 2015. Whole genome/exome sequencing in mood and psychotic disorders. Psychiatry Clin. Neurosci. 69: 65–76. [DOI] [PubMed] [Google Scholar]
- Lamantia A. S., 1999. Forebrain induction, retinoic acid, and vulnerability to schizophrenia: insights from molecular and genetic analysis in developing mice. Biol. Psychiatry 46: 19–30. [DOI] [PubMed] [Google Scholar]
- Lee S. H., Wray N. R., Goddard M. E., Visscher P. M., 2011. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88: 294–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S. H., DeCandia T. R., Ripke S., JYang, The Schizophrenia Psychiatric Genome-Wide Association Study Consortium et al, 2012. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 44: 247–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leisch, F., A. Weingessel, and K. Hornik, 1998 On the generation of correlated artificial binary data. Working Papers SFB “Adaptive Information Systems and Modelling in Economics and Management Science.” Vienna University of Economics and Business, Vienna.
- Loh P.-R., G. Bhatia, A. Gusev, H. K. Finucane, B. K. Bulik-Sullivan et al, 2015. Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis. Nat. Genet. 47: 1385–1392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA. [Google Scholar]
- Maino K., Gruber R., Riedel M., Seitz N., Schwarz M., et al. , 2007. T- and B-lymphocytes in patients with schizophrenia in acute psychotic episode and the course of the treatment. Psychiatry Res. 152: 173–180. [DOI] [PubMed] [Google Scholar]
- McCarthy M. I., Abecasis G. R., Cardon L. R., Goldstein D. B., Little J. et al, 2008. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9: 356–369. [DOI] [PubMed] [Google Scholar]
- McLaren W., Pritchard B., Rios D., Chen Y., Flicek P., et al. , 2010. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069–2070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mooney M. A., Nigg J. T., McWeeney S. K., Wilmot B., 2014. Functional and genomic context in pathway analysis of GWAS data. Trends Genet. 30: 390–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelder J. A., Mead R., Nelder B. J. A., Mead R., 1965. A simplex method for function minimization. Comput. J. 7: 308–313. [Google Scholar]
- Network and Pathway Analysis Subgroup of the Psychiatric Genomics Consortium , 2015. Psychiatric genome-wide association study analyses implicate neuronal, immune and histone pathways. Nat. Neurosci. 18: 199–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newton M. A., Quintana F. A., den Boon J. A., Sengupta S., Ahlquist P., 2007. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1: 85–106. [Google Scholar]
- Pol H. E. H., Schnack H. G., Bertens M. G. B. C., van Haren N. E. M., van der Tweel I., et al. , 2002. Volume changes in gray matter in patients with schizophrenia. Am. J. Psychiatry 159: 244–250. [DOI] [PubMed] [Google Scholar]
- Powers D., 2011. Evaluation: from precision, recall and f-measure to roc., informedness, markedness & correlation. J. Mach. Learn. Technol. 2: 37–63. [Google Scholar]
- Ripke S., CO’Dushlaine, KChambert, J. LMoran, A. KKähler et al, 2013. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 45: 1150–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sarup P., Jensen J., Ostersen T., Henryon M., Sørensen P., 2016. Increased prediction accuracy using a genomic feature model including prior information on quantitative trait locus regions in purebred Danish Duroc pigs. BMC Genet. 17: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schadt E. E., 2009. Molecular networks as sensors and drivers of common human diseases. Nature 461: 218–223. [DOI] [PubMed] [Google Scholar]
- Schizophrenia Psychiatric Genome-Wide Association Study Consortium , 2011. Genome-wide association study identifies five new schizophrenia loci. Nat. Genet. 43: 969–978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schizophrenia Working Group of the Psychiatric Genomics Consortium , 2014. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511: 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Speed D., Balding D. J., 2014. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 24: 1550–1557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A., Tamayo P., Mootha V. K., Mukherjee S., Ebert B. L., et al. , 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102: 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sullivan P. F., Kendler K. S., Neale M. C., 2003. Schizophrenia as a complex trait. Arch. Gen. Psychiatry 60: 1187–1192. [DOI] [PubMed] [Google Scholar]
- VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
- Wang K., Li M., Bucan M., 2007. Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81: 1278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K., Li M., Hakonarson H., 2010. Analysing biological pathways in genome-wide association studies. Natl. Rev. 11: 843–854. [DOI] [PubMed] [Google Scholar]
- Wang L., Jia P., Wolfinger R. D., Chen X., Zhao Z., 2011. Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics 98: 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X., Morris N. J., Zhu X., Elston R. C., 2013. A variance component based multi-marker association test using family and unrelated data. BMC Genet. 14: 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M. C., Lee S., Cai T., Li Y., Boehnke M., et al. , 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89: 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
To protect the identity of the Danish individuals, the genotypic data generated for the Danish study are not publicly available. Data are available through the Aarhus University Data Access Committee for researchers who meet the criteria for access to confidential data. To request access to the data, contact John Westensee at jwe@au.dk.