Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 May 11;33(18):2882–2889. doi: 10.1093/bioinformatics/btx314

IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies

Mingwei Dai 1,2, Jingsi Ming 2, Mingxuan Cai 2, Jin Liu 3, Can Yang 2,*, Xiang Wan 4,*, Zongben Xu 1,*
PMCID: PMC5860575  PMID: 28498950

Abstract

Motivation: Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as ‘polygenicity’. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question.

Results: In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by integrating individual level genotype data and summary statistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% (±0.4%) to 69.4% (±0.1%) using about 240 000 variants.

Availability and implementation: The IGESS software is available at https://github.com/daviddaigithub/IGESS.

Contact: zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

As of October 2016, more than 29 000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly (P-value 5.0×108) associated with complex human phenotypes (including quantitative traits and complex diseases) in about 2600 GWAS (see the GWAS Catalog database https://www.ebi.ac.uk/gwas/) (Welter et al., 2014). However, these genome-wide significant hits can only explain a small proportion of genetic contribution to complex human phenotypes. For example, 70–80% of variance of human height can be attributed to genetic variations (i.e. heritability of human height is about 70–80%) (Visscher et al., 2008) while the GWAS hits of human height can only explain about 10% of the variance (Allen et al., 2010, Wood et al., 2014). This is the so-called ‘missing heritability’ problem (Manolio et al., 2009). Recent progresses suggested that most of the heritability is not missing but can be explained by many common SNPs with individually weak effects (Visscher et al., 2012; Yang et al., 2010). However, due to limited sample sizes, genetic variants with small effects do not achieve the genome-wide statistical significance and thus the majority of them remain undiscovered yet. Recently, world-wide researchers are forming large genomic consortia, such as the Genetic Investigation of ANthropometric Traits (GIANT) Consortium and the Wellcome Trust Case Control Consortium (WTCCC), to collect samples and analyze data.

Numerous GWAS data analysis methods (e.g. PLINK (Purcell et al., 2007), BOOST (Wan et al., 2010) and GCTA (Yang et al., 2011)) have recently been proposed. A detailed survey of these tools can be found in Stephens and Balding (2009), Cantor et al. (2010) and Pasaniuc and Price (2016). These methods could be roughly divided into two categories: (i) individual-level data analysis and (ii) summary data (mostly P-values or z-scores) analysis. Some popular methods in the first category include penalized regression methods (Liu et al., 2013; Wu et al., 2009) and Bayesian regression methods (Carbonetto et al., 2012; Zhou et al., 2013). It is noteworthy that linear mixed models (LMM) are gaining increasing interest in genomic data analysis. Some popular LMM-based approaches include GEMMA (Zhou and Stephens, 2012) and SKAT (Ionita-Laza et al., 2013). In the absence of individual-level data, methods in the second category play the major role. GSEA (Zhang et al., 2010) and VEGAS (Liu et al., 2010) are the two well-known methods in this category. Some recent approaches including CPASSOC (Zhu et al., 2015), GPA (Chung et al., 2014), EPS (Liu et al., 2016) and LD-score regression (Bulik-Sullivan et al., 2015), are able to incorporate functional data, such as DNase I hypersensitive site (DHS) data and expression quantitative trait loci (eQTL) data. Compared with the methods in the first category, the methods in the second category often have computational advantages but may be statistically less efficient because only summary statistics are used.

It may be more preferable for researchers to work with the individual-level data as it retains more information than summary statistics data. However, it is often hard for a research group to get fully access to the individual-level data of large sample sizes. For instance, a core research group from the GIANT consortium reported that they could only access genotype data from about 44 000 individuals (Yang et al., 2015) while the total sample size is more than 250 000 for the consortium (Wood et al., 2014). In reality, access to individual-level data is often quite restricted due to practical issues, such as privacy protection issue and logistics in data transportation and storage. On the contrary, summary statistics from GWAS are widely available through many public gateways. Therefore, it is very typical for a research group to get access to summary statistics from hundreds of GWAS with a large sample size freely, but only have limited sample size of individual-level data (usually a few hundreds or thousands of samples at hand). A simple strategy in this situation is to first obtain summary statistics from individual-level data and then conduct some meta-analyses with the collected summary data. But this strategy does not fully make use of the individual-level data, leading to inefficient use of data at hand. How to make the most efficient use of existing data resources to pinpoint disease-associated genetic variants with small effects is of great interest.

In this study, we propose a statistical approach, IGESS, to integrating individual level genotype data and summary statistics for exploration of genetic architechture of complex phenotypes. An efficient algorithm based on variational inference has been developed such that it is scale up to genome-wide analysis. Not only does IGESS provide the posterior probability of association status between each SNP and the given phenotype, but also offer the effect size of each SNP for risk prediction. We conducted comprehensive simulation studies to evaluate the performance of IGESS and then applied it to Crohn’s disease (CD). The results demonstrate that IGESS is able to integrate different types of data, gaining increased power in identification of risk variants and improved accuracy of risk prediction.

2 IGESS

2.1 Model

Given a phenotype, suppose we have an individual-level GWAS dataset {y,X} of N samples, where yN is the vector of phenotypic values and XN×M is the genotype matrix of M SNPs from these samples. Without loss of generality, we assume both X and y have been centered. In addition, we collect summary statistics, i.e. P-values for this phenotype from K independent GWAS in matrix P=[pjk]M×K, where pjk corresponds to the P-value of the jth SNP in the kth GWAS. First, we consider the following linear model that links y to X,

y=Xβ+e, (1)

where β=[β1,,βM]T is a vector of effect sizes, and eN(0,σe2I) is an independent noise term due to non-genetic factors. Under this model, identification of risk variants is equivalent to identifying nonzero entries in β. For this purpose, we introduce a binary variable γj to indicate whether βj is zero or not. Assuming the spike and slab prior (Mitchell and Beauchamp, 1988) for βj, we have

βj|γj;σβ2{N(βj|0,σβ2)if γj=1,δ0(βj)if γj=0, (2)

where N(βj|0,σβ2) denotes the Gaussian distribution with mean 0 and variance σβ2 and δ0(βj) is the Dirac function centered at zero, and γj is assumed to be drawn from the Bernoulli distribution Bern(γj|π),

γj|ππγj(1π)1γj. (3)

To incorporate information from summary statistics, we also consider a two-groups model, i.e. the P-values of each study are assumed to come from the mixture of null and non-null groups. Since we are considering the same phenotype across multiple studies, it is reasonable to assume the vector of association status γ is the same in these studies. This gives us the following model: The obtained P-values from the kth GWAS are assumed to come from a mixture distribution

pjk|γj;αk{U(0,1)if γj=0,Beta(pjk|αk,1)if γj=1, (4)

where P-values from the null group follow the uniform distribution U(0,1), and P-values of the kth study from the non-null group follow a beta distribution with parameter (αk,1), which captures the pattern that P-values from the non-null group is closer to zero. Comprehensive experiment results provided in the Supplementary Document demonstrate that the beta distribution is a good approximation of the P-value distribution under the alternative. Given γj, we assume pj1,,pjK are independent. Thus, we have

Pr(P|γ)=j=1M(k=1Kαkpjkαk1)γj. (5)

Let θ={π,σβ2,σe2,αk,k=1K} be the collection of model parameters. The probabilistic model can be written as

Pr(y,P,β,γ|X;θ)=Pr(y|X,β;θ)Pr(β|γ;θ)Pr(γ|θ)Pr(P|γ;θ). (6)

Our goal is to first obtain θ^ (the estimate of θ) by maximizing the marginal likelihood (known as type II maximum likelihood estimate or Empirical Bayes estimate)

Pr(y,P|X;θ)=γβPr(y,P,β,γ|X;θ)dβ, (7)

and then compute the posterior

Pr(β,γ|y,X,P;θ^)=Pr(y,P,β,γ|X;θ^)Pr(y,P|X;θ^). (8)

The above IGESS model was developed with quantitative traits following a Gaussian distribution. Although it can be extended to handle case-control studies using the logit or probit link function, the performance has very minor difference from the Gaussian assumption as long as the sample size is moderate (e.g. a few hundreds) but the computational cost is much more expensive. In fact, direct application of linear models to the analysis of case-control GWAS datasets has been justified in Kang et al. (2010). Therefore, we focus on the Gaussian assumption in this paper. Results from both simulation study and real data analysis suggest that linear models (Gaussian assumption) can often provide satisfactory performance on risk variant identification and risk prediction in case-control studies when the population structure is not very complex (see comprehensive results in the Supplementary Document).

2.2 Algorithm

The challenge is that the exact evaluation of (7) is intractable. To overcome this difficulty, we derive an efficient algorithm based on variational inference (Bishop, 2006). Before we describe the algorithm, we first reparameterize our model to get rid of the Dirac function such that we can have an easier derivation. Let β˜j be a Gaussian variable with N(β˜j|0,σβ2), and γj be a Bernoulli variable with Bern(γj|π), respectively. Their product β˜jγj has the exact the same distribution as βj in (2). With this reparameterization, the joint model (6) becomes

Pr(y,P,β˜,γ|X;θ)=Pr(y|X,β˜,γ;θ)Pr(β˜;θ)Pr(γ|θ)Pr(P|γ;θ), (9)

where

Pr(y|X,β˜,γ;θ)=N(y|jxjβ˜jγj,σe2I),Pr(β˜;θ)=jMN(β˜j|0,σβ2),Pr(γ|θ)=jMπγj(1π)1γj,Pr(P|γ;θ)=jM(kKαkpjkαk1)γj. (10)

Clearly, the Dirac function is not involved and the prior of β˜j does not depend on γj anymore. Next, we apply variational approximation to Pr(y,P,β˜,γ|X;θ). Let q(β˜,γ) be an approximated distribution of the posterior Pr(β˜,γ|y,X,P;θ). Then we have a lower bound of the logarithm of the marginal likelihood

logPr(y,P|X;θ)=logγβ˜Pr(y,P,β˜,γ|θ)dβ˜γβ˜q(β˜,γ)logPr(y,P,β˜,γ|X;θ)q(β˜,γ)dβ˜=Eq[logPr(y,P,β˜,γ|X;θ)logq(β˜,γ)]:=(q), (11)

where the inequality follows Jensen’s inequality and the equality holds if and only if q(β˜,γ) is the true posterior Pr(β˜,γ|y,P,X;θ). Instead of working with the marginal likelihood, we can iteratively maximize (q). To make it feasible to evaluate this lower bound, we use mean-field method (Bishop, 2006), assuming that q(β˜,γ) can be factorized as

q(β˜,γ)=j=1Mqj(β˜j,γj). (12)

This is the only assumption we made in variational approximation. According to the nice property of factorized distributions in variational inference (Bishop, 2006), we can obtain the best approximation as

logqj(β˜j,γj)=Eij[logPr(y,P,β˜,γ|X;θ)]+const, (13)

where the expectation is taken with respect to all of the other factors {qi(β˜i,γi)} for ij. After some derivations (refer to Section 1.1 of the Supplementary Document), we have

qj(β˜j,γj)=[πjN(β˜j|μj,sj2)]γj[(1πj)N(β˜j|0,σβ2)]1γj, (14)

where

sj2=σe2xjTxj+σe2σβ2,
μj=xjTyijE[γiβ˜i]xjTxixjTxj+σe2σβ2, (15)

and

πj=11+exp(wj),
wj=logπ1π+12logsj2σβ2+μj22sj2+k=1Klog(αkpjkαk1). (16)

Since q(β˜,γ) is an approximation to the true posterior, the above result (14) can be interpreted as follows. Here πj can be viewed as an approximation of Pr(γj=1|y,X,P;θ). As we can see, the posterior of γj=1 is affected by its prior and the evidence from both individual-level data and P-values. When SNP j is irrelevant to the phenotype (γj=0), the approximated posterior of β˜j remains the same as its prior, i.e. β˜jN(β˜j|0,σβ2). When SNP j is relevant, its posterior changes accordingly as β˜jN(β˜j|μj,sj2).

With qj(β˜j,γj) given in (14), we can evaluate the lower bound analytically

L(q)=Eq[logPr(y,β˜,γ,P|X,θ)]Eq[logq(β˜,γ)]=N2logσe2||yj=1Mπjμjxj||22σe212σe2j=1M [πj(sj2+μj2)(πjμj)2]xjTxj12σβ2j=1M [πj(μj2+sj2)+(1πj)σβ2]+j=1Mπjlog(ππj)+j=1M(1πj)log(1π1πj)+j=1Mπjk=1Klog(αkpjkαk1)+j=1M12πjlogsj2σβ2+Const.

Therefore, model parameters in θ can be updated by solving θ=0 as

π=jπjM,σβ2=jπj(μj2+sj2)jπj,αk=j=1Mπjj=1Mπj(logpjk),
σe2=||yj=1Mπjμjxj||2+j=1M(πj(sj2+μj2)(πjμj)2)xjTxjN.

In summary, our algorithm can be viewed as a variational expectation-maximization (EM) algorithm. In the expectation step, we evaluate the expectation w.r.t. the distribution q to obtain the lower bound (q) given in (11). In the maximization step, we maximize current (q) w.r.t. model parameters in θ. Hence, the lower bound increases in each EM iteration and the convergence of the proposed algorithm is guaranteed. Alternatively, the variational EM algorithm can be considered as a deterministic approximation to a Monte-Carlo EM algorithm which is a golden standard for statistical inference. We have provided detailed discussions about their connection in the Supplementary Document.

2.3 Identification of risk variants and risk prediction

After the convergence of the algorithm, we extract information from our model for identification of risk variants and risk prediction. Since πj given in (16) is an approximation to the true posterior Pr(γj=1|y,P,X;θ) of SNP j, we use fdrj=1πj as an approximation of the local false discovery rate (FDR) of SNP j (Efron, 2010) and thus SNP j will be considered as a risk variant if fdrj is close to 0, e.g. fdrj0.05.

Besides identification of risk variants, we are also interested in risk prediction. Based on the reparameterized model, the effect size of SNP j is given as E(γjβ˜j)=πjμj. Given a genotype vector x˜=[x˜1,,x˜M]T of an individual, the predicted phenotypic value will be y^=c0+j(x˜jcj)πjμj, where c0 and c1,,cp are the mean of the phenotype and each SNP before centering, respectively. For case-control study, y^ should only be interpreted as the relative risk score rather than the absolute risk score (Chatterjee et al., 2016).

2.4 Selection of informative studies from summary statistics data

As given in Equations (2–4), we assume variant j has the same association status (denoted as γj) across all studies. This assumption is reasonable when all the studies are conducted for the same phenotype in the same population. However, this assumption may be violated if GWAS are conducted in different populations. Here we propose a forward stepwise procedure to select informative studies from summary statistics data. We use prediction accuracy by cross-validation as the criterion to select summary statistics from relevant study. Precisely, the prediction accuracy is measured by correlation between predicted values and observed values in quantitative trait studies or the area under the receiver operating characteristic (ROC) curve (AUC) in case-control studies. In forward stepwise selection, IGESS tries to add one summary-statistic data at a time and picks the summary-statistic dataset which maximizes prediction accuracy. If prediction accuracy gets worse when a summary-statistic data is incorporated, IGESS will automatically exclude this study in the following steps. More specifically, we start with individual level data {X,y} to have a baseline model. Then we try to add each column of P to the baseline model to check the model performance and select the best one into the model. We keep adding the columns of P in such a way until all relevant columns have been included in the model. The cross-validation is used to determine the right model.

2.5 Computational cost of the variational EM algorithm

Regarding the computational time of IGESS, here we report the CPU time for 100 iterations with different sample size N and different number of SNPs M in Figure 1. All the timings were carried out on a notebook with a configuration of 2.6 GHz Intel Core i7 and 16G memory. The computational time is nearly linear with respect to N and M. In real data analysis, the total number of iterations depends on the strength of the GWAS signal. The stronger the GWAS signal, the faster the algorithm converges. IGESS often takes less than twenty minutes to handle a GWAS dataset with thousands of samples and hundreds of thousands of SNPs. The R package of IGESS is available at https://github.com/daviddaigithub/IGESS.

Fig. 1.

Fig. 1.

Timing (CPU seconds) with respect to different sample size N and different number of SNPs M (Color version of this figure is available at Bioinformatics online.)

3 Results

3.1 Simulation

In this section, we describe how to evaluate the performance of IGESS in simulation studies. We first evaluated its performance on identification of risk variants, in comparison with Lasso (Tibshirani, 1996) and BVSR (Carbonetto et al., 2012) which only take individual-level data as input, as well as P-value-based ranking and CPASSOC (Zhu et al., 2015) which only use summary statistics (specifically, z-scores). After that, we evaluated its accuracy on risk prediction, where the performance of BVSR (Carbonetto et al., 2012) and Lasso (Friedman et al., 2010; Tibshirani, 1996) served as a reference because they only use individual-level data.

3.1.1 Simulation settings

The simulation datasets were generated as follows. For the individual-level dataset, the genotype matrix X was first simulated from normal distribution, where autoregressive correlation ρ|jj| was set to mimic the linkage disequilibrium between variants j and j. Next, the entries in X were discretized to genotype codes {0, 1, 2} according to the Hardy-Weinberg principle based on the minor allele frequencies drawn from U[0.05,0.5]. The number of samples and the number of variants were set to be N = 2000 and M = 10 000, respectively. To simulate the vector of effect sizes β, we randomly chose 500 nonzero entries and simulated their values from N(0, 1). Accordingly, the association status was recorded in γ. Since heritability is defined as h=Var(Xβ)Var(Xβ)+σe2, the noise level was specified according to the pre-specified heritability at {0.3,0.5,0.8}.

To simulate summary statistics, we did not adopt the generative model (4) but obtain them from the individual-level data. This helps us to evaluate whether the Beta distribution can model the P-values from the non-null group. We considered two scenarios to simulate the positions of nonzero entries. In the first scenario, the vector of association status γ(k) of the kth study was set to be the same as γ, for k=1,,K. Clearly, this setting is favorable to our model as all studies share the same sets of associated risk variants. In the second scenario, γ(k) was allowed to be different from γ. To do so, we varied the conditional probability uk=Pr(γj(k)=1|γj=1) at {0.05,0.1,0.3,0.5,0.7,0.9,1}. In such a way, γ(k) was simulated and the corresponding nonzero entries of β(k) were simulated from N(0, 1). The assumption of IGESS is violated in this scenario. For each study k, we first used the above steps to generate genotype matrix X(k) and phenotype data y(k), then we conducted univariate linear regression to get summary statistics (z-score and P-value) for each SNP across K studies.

For IGESS, both the individual-level data {X,y} and the P-value matrix P were used, where {X(k),y(k)} were pretended to be unavailable. For BVSR and the Lasso, only the individual level dataset {X,y} was used and their performance could serve as a reference. For CPASSOC, an M×(1+K) matrix Z containing the z-scores from {X,y} and the z-scores from K studies was used as its input.

3.1.2 Results

To evaluate the performance of risk variant identification, we adopted the area under the receiver operating characteristic (ROC) curve (AUC). We started with the results evaluated in the first scenario where γ(k)=γ for k=1,2,,K. Figure 2 shows the comparison results of IGESS, BVSR, CPASSOC, Lasso and P-value-based ranking for autoregressive correlation ρ=0.6 and K = 1, 2, 6. The results for other parameter settings, e.g. ρ=0,0.3, are given in the Supplementary Document. Here the results of BVSR, Lasso and P-value-based ranking only serve as a reference since they only used {X,y} as input. In the left panel, the AUC of both IGESS and CPASSOC steadily increases as the number of summary statistics datasets K increases. IGESS has similar performance with CPASSOC when heritability is moderate (h = 0.5) and outperforms CPASSOC when the heritability is high (h = 0.8). We have observed that CPASSOC is slightly better than IGESS in terms of AUC when heritability is very small (e.g. h = 0.3). A closer examination reveals that both methods have nearly zero power with the nominal FDR controlled at 0.1. This implies that the slightly better AUC of CPASSOC is due to the ranking results of risk variants with a larger false positive rate which is not of interest in practice (see more results in the Supplementary Document). Because the top findings with small false positive rate (FPR) are of more interest, the ROC curves of the three methods with FPR < 0.1 are provided in the Supplementary Document. To see the benefits of integrating individual-level data with summary statistics in risk predication, we evaluated prediction accuracy of IGESS, where the performance of BVSR and Lasso can serve as a reference. The right panel of Figure 2 shows the performance of the three methods, indicating that prediction accuracy can be steadily improved by incorporating summary statistics.

Fig. 2.

Fig. 2.

Comparison of IGESS, BVSR, CPASSOC, Lasso and P-value-based ranking with different heritability h, and different number of GWAS studies K. Left panel: Performance of risk variant identification measured by AUC. Right panel: Performance of risk prediction measured by correlation between the observed phenotype values and the predicted values. The autoregressive correlation ρ=0.6 and the number of summary statistics datasets k = 1, 2 and 6. All the results are summarized based on 50 replications (Color version of this figure is available at Bioinformatics online.)

Next, we investigated the performance of IGESS in the second scenario. We simulated P-values from K = 7 studies, where uk=Pr(γj(k)=1|γj=1){0.05,0.1,0.3,0.5,0.7,0.9,1}. Note that uk also indicates the proportion of nonzero effect sizes shared between the study with individual-level data and each study with summary statistics. The stepwise strategy (Section 2.4) was used to select relevant studies. Figure 3 shows performance of risk variant identification (measured by AUC in the left panel) and prediction accuracy (measured by the correlation between the observed values and predicted values in the right panel) during the stepwise process. At the first step, the stepwise strategy attempted to combine individual-level data {X,y} with P-values from each study. The first subplot of the left panel shows that the performance of risk variant identification degrades when the P-values from studies with uk{0.05,0.1,0.3} are incorporated and the performance starts to be improved from uk=0.5. Since IGESS adopts the forward stepwise strategy, it will select the study which offers the best improvement of the prediction performance at each step and keep adding P-values from remaining studies, as indicated by s=1,,7. As we can see clearly, the performance of IGESS improves steadily until irrelevant or nearly irrelevant studies included (s = 5). The similar pattern has been observed for the performance of prediction accuracy, as shown in the right panel of Figure 3. In summary, it is clear that the performance improvement of IGESS does not require uk = 1, suggesting that IGESS is robust to the noise in the P-value dataset.

Fig. 3.

Fig. 3.

Performance evaluation in presence of irrelevant studies. Left panels: Performance of risk variant identification measured by AUC. Right panels: Performance of risk prediction measured by correlation between the observed phenotype values and the predicted values. In all panels, {X,y} in the x-axis indicates the performance of individual-level data only, q corresponds to the performance of integrating the individual-level data with P-values from a study simulated using parameter uk=Pr(γj(k)=1|γj=1){0.05,0.1,0.3,0.5,0.7,0.9,1}, and s indicates the stepwise performance achieved at the s-step

3.2 Real data analysis

We applied IGESS to analyze GWAS data of Crohn’s disease (CD). The individual-level data is from the Welcome Trust Case Control Consortium (WTCCC) (Burton et al., 2007). There are 5009 samples, of which 2005 are cases and 3004 are controls. The summary data of Crohn’s Disease is from Franke et al. (2010), comprised of the P-values of six GWAS in total.

3.2.1 Quality control on the individual-level data

We performed strict quality control on the individual-level data from WTCCC using PLINK(Purcell et al., 2007) and GCTA (Yang et al., 2011). First, we removed individuals with >2% missing genotypes. For the case of CD and the two control datasets, we removed SNPs with minor allele frequencies <0.05 and SNPs with missing rate >1%. Then we combined the case sets of CD and the two control sets into case/control studies. SNPs with P-value <0.001 in Hardy-Weinberg equilibrium test were removed. Pairs of subjects with estimated relatedness >0.025% were identified and one member of these pairs was removed at random by GCTA (Yang et al., 2011). After quality control, we had a dataset including 1656 cases and 2880 controls, each with 308 950 SNPs.

3.2.2 Summary-statistic data for CD

Besides the WTCCC data, Franke et al. (2010) provided summary statistics of CD. The samples were from Germany, Cedars-Sinai Medical Center (Los Angeles, California, USA), the Children’s Hospital of Philadelphia, Scotland, Toronto and Italy Crohn’s and Colitis in Childhood Study consortium (of early onset cases), and the National Institute of Diabetes and Digestive and Kidney Disease (NIDDK) and Belgian-French Studies. According to Franke et al. (2010), the samples from the Cedars-Sinai Medical Center were divided into two studies (Cedar1 and Cedar2) and the samples from NIDDK were divided into the Jewish study (NiddkJ) and the non-Jewish study (NiddkNJ). After extracting overlapped SNPs of individual-level data (after quality control) and summary statistics, we had the individual-level data {X}N×M,yN×1 and PM×K for the analysis, where N = 4536, M = 248 409 and K = 7.

3.2.3 Analysis result of Crohn’s disease

We first applied IGESS to analyze the WTCCC data only and then incorporated P-values from each of the seven studies. The results are summarized in Table 1, where the number of GWAS hits is reported based on fdr0.05. The locations of those GWAS hits in the genome are shown in the Manhattan plots in Figure 4. As a reference, the analysis result of Lasso is also reported in Table 1. Clearly, the P-values from the Early Onset dataset offers most relevant information, improving prediction accuracy (measured by AUC) from 63.28% (± 0.37%) to 68.62% (± 0.36%). Meanwhile, we also observe that the incorporation of P-values from the Belge and Cedar2 studies degraded the performance. This is because the P-values from these two studies are inflated, as pointed out in Franke et al. (2010). Therefore, IGESS would not consider the P-values from the Belge and Cedar2 studies in the selection process. As the P-values from the remaining studies been selected in the stepwise manner, the prediction accuracy of IGESS keeps increasing until 69.4% (±0.1%), as shown in Figure 5. The complete list of GWAS hits (fdr0.05) and their corresponding genes are provided in the Supplementary Document.

Table 1.

Summary of IGESS analysis result on Crohn’s disease

Data No. of hitsa Prediction accuracy (AUC)
WTCCC only (BVSR) 7 63.2% ± 0.4%
WTCCC only (Lasso) 227b 63.5% ± 0.5%
WTCCC+Belge 10 62.1% ± 0.6%
WTCCC+Cedar1 2 64.3% ± 0.5%
WTCCC+Cedar2 17 61.9% ± 0.7%
WTCCC+Early Onset 120 68.6% ± 0.3%
WTCCC+German 10 65.4% ± 0.3%
WTCCC+NiddkJ 7 64.1% ± 0.4%
WTCCC+NiddkNJ 3 64.3% ± 0.5%
a

The number of GWAS hits is reported based on estimated fdr0.05.

b

For Lasso, we reported the number of nonzero coefficients with regularization parameter selected by cross-validation using R package ‘glmnet’ (Friedman et al., 2010).

Fig. 4.

Fig. 4.

Manhattan plots of the analysis results of Crohn’s Disease using IGESS. The red lines and blue lines are corresponding to fdr = 0.05, and fdr = 0.1, respectively (Color version of this figure is available at Bioinformatics online.)

Fig. 5.

Fig. 5.

Prediction accuracy measured by AUC during forward stepwise selection

At last, we demonstrated the benefit of using IGESS with different sample sizes of the individual-level data. To do so, we randomly partitioned the WTCCC data (N = 4536) into five folds of nearly equal sample size. We used IGESS to integrate the P-values from the Early Onset study with individual-level data of different sample sizes. Specifically, we used the first k=1,2,3,4 folds of the WTCCC data as the training data (with sample size N = 908, N = 1815, N = 2722 and N = 3629, respectively) and the fifth fold as the testing data. The prediction accuracies of different sample sizes are shown in Figure 6. As we can see here, a more noticeable improvement in prediction can be achieved when the sample size of the individual-level data is smaller. For example, when N = 908, the prediction accuracy is improved by 8.7% from 55.6%(±2.8%) to 64.3%(±1.9%). This result suggests that more benefits can be gained by integrating summary statistics when the sample size of individual level data is small.

Fig. 6.

Fig. 6.

Improvement of prediction accuracy with different sample sizes after incorporating P-values from the Early Onset study (Color version of this figure is available at Bioinformatics online.)

4 Conclusion

It is often difficult to get approval of accessing individual-level data of a large sample size. In contrast, the summary statistics of many GWAS have been widely available via public gateway. The integration of individual-level data and summary statistics is a promising direction to address the challenges raised by the polygenicity of complex phenotypes. There is a great demanding of the methods and tools that can help researchers to mine scientific evidence hidden in the rich data resources. In this paper, we propose a statistical approach, IGESS, to integrating data from both the individual-level and the summary level. The variational EM algorithm is shown to be scalable to handle genome-wide data in minutes. The results from comprehensive simulations suggest the advantages of IGESS and the real data analysis of Crohn’s disease indicates the applicability of IGESS in practice. We believe that IGESS can serve as an effective tool in integrating individual-level data and summary statistics for more powerful genetic analysis.

Supplementary Material

Supplementary Data

Acknowledgments

Funding

This work was supported in part by grant NO. 61501389 from National Science Funding of China, grants NO. 22302815, NO. 12316116 and NO. 12202114 from the Hong Kong Research Grant Council, and grants FRG2/14-15/069, FRG2/14-15/077 and FRG2/15-16/011 from Hong Kong Baptist University, and Duke-NUS Medical School WBS: R-913-200-098-263, and MOE2016-T2-2-029 from Ministry of Eduction, Singapore. The Crohn’s disease GWAS data used in this study is from the Wellcome Trust Case Control Consortium (WTCCC). A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.

Conflict of Interest: none declared.

References

  1. Allen H.L. et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467, 832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bishop C.M. (2006) Pattern Recognition and Machine Learning. Springer, New York. [Google Scholar]
  3. Bulik-Sullivan B.K. et al. (2015) Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet., 47, 291–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Burton P.R. et al. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447, 661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cantor R.M. et al. (2010) Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet., 86, 6–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carbonetto P. et al. (2012) Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal., 7, 73–108. [Google Scholar]
  7. Chatterjee N. et al. (2016) Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet., 17, 392–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chung D. et al. (2014) GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet., 10, e1004787.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Efron B. (2010) Large-Scale Inference: empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press, Cambridge, vol. 1. [Google Scholar]
  10. Franke A. et al. (2010) Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet., 42, 1118–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
  12. Ionita-Laza I. et al. (2013) Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet., 92, 841–853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kang H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42, 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Liu J. et al. (2013) Incorporating group correlations in genome-wide association studies using smoothed group lasso. Biostatistics, 14, 205–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Liu J. et al. (2016) EPS: an empirical Bayes approach to integrating pleiotropy and tissue-specific information for prioritizing risk genes. Bioinformatics, btw081. [DOI] [PubMed] [Google Scholar]
  16. Liu J.Z. et al. (2010) A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet., 87, 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Manolio T.A. et al. (2009) Finding the missing heritability of complex diseases. Nature, 461, 747–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Mitchell T.J., Beauchamp J.J. (1988) Bayesian variable selection in linear regression. J. Am. Stat. Assoc., 83, 1023–1032. [Google Scholar]
  19. Pasaniuc B., Price A. (2016) Dissecting the genetics of complex traits using summary association statistics. Nature Rev. Genet., 18, 117–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Purcell S. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Stephens M., Balding D.J. (2009) Bayesian statistical methods for genetic association studies. Nat. Rev. Genet., 10, 681–690. [DOI] [PubMed] [Google Scholar]
  22. Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological), 58, 267–288. [Google Scholar]
  23. Visscher P.M. et al. (2008) Heritability in the genomics eraconcepts and misconceptions. Nat. Rev. Genet., 9, 255–266. [DOI] [PubMed] [Google Scholar]
  24. Visscher P.M. et al. (2012) Five years of GWAS discovery. Am. J. Hum. Genet., 90, 7–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Wan X. et al. (2010) BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet., 87, 325–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Welter D. et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res., 42, D1001–D1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wood A.R. et al. (2014) Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet., 46, 1173–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wu T.T. et al. (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25, 714–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Yang J. et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat. Genet., 42, 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Yang J. et al. (2011) GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet., 88, 76–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yang J. et al. (2015) Genome-wide genetic homogeneity between sexes and populations for human height and body mass index. Hum. Mol. Genet., 24, 7445–7449. [DOI] [PubMed] [Google Scholar]
  32. Zhang K. et al. (2010) i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res., 38, W90–W95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhou X., Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat. Genet., 44, 821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhou X. et al. (2013) Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet., 9, e1003264.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhu X. et al. (2015) Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am. J. Hum. Genet., 96, 21–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES