Application of Bayesian regression with singular value decomposition method in association studies for sequence data

Soonil Kwon; Xiaofei Yan; Jinrui Cui; Jie Yao; Kai Yang; Donald Tsiang; Xiaohui Li; Jerome I Rotter; Xiuqing Guo

doi:10.1186/1753-6561-5-S9-S57

. 2011 Nov 29;5(Suppl 9):S57. doi: 10.1186/1753-6561-5-S9-S57

Application of Bayesian regression with singular value decomposition method in association studies for sequence data

Soonil Kwon ^1,², Xiaofei Yan ^1,², Jinrui Cui ^1,², Jie Yao ^1,², Kai Yang ¹, Donald Tsiang ¹, Xiaohui Li ^1,², Jerome I Rotter ^1,³, Xiuqing Guo ^1,^2,^3,^✉

PMCID: PMC3287895 PMID: 22373181

Abstract

Genetic association studies usually involve a large number of single-nucleotide polymorphisms (SNPs) (k) and a relative small sample size (n), which produces the situation that k is much greater than n. Because conventional statistical approaches are unable to deal with multiple SNPs simultaneously when k is much greater than n, single-SNP association studies have been used to identify genes involved in a disease’s pathophysiology, which causes a multiple testing problem. To evaluate the contribution of multiple SNPs simultaneously to disease traits when k is much greater than n, we developed the Bayesian regression with singular value decomposition (BRSVD) method. The method reduces the dimension of the design matrix from k to n by applying singular value decomposition to the design matrix. We evaluated the model using a Markov chain Monte Carlo simulation with Gibbs sampler constructed from the posterior densities driven by conjugate prior densities. Permutation was incorporated to generate empirical p-values. We applied the BRSVD method to the sequence data provided by Genetic Analysis Workshop 17 and found that the BRSVD method is a practical method that can be used to analyze sequence data in comparison to the single-SNP association test and the penalized regression method.

Background

Association studies usually involve a large number of single-nucleotide polymorphisms (SNPs) (k) and a relatively small number of samples (n). To avoid multiple testing problems and to consider the effect of multiple SNPs simultaneously, investigators need statistical models that will test multiple SNPs simultaneously. Because standard statistical methods are unable to analyze multiple SNPs simultaneously when k is much greater than n, Tibshirani [1] introduced the penalized regression (PR) method as an alternative. The method reduces the size of SNP coefficients by treating the coefficients with little effect as zero. In other words, only those SNPs that significantly improve prediction are kept in the model. A potential drawback of this method is that a SNP with a strong marginal effect might be removed from the model if some other SNPs can explain the effect. A second drawback is that the number of SNPs evaluated in the model is controlled by the chosen penalization parameter. Even though the PR method does evaluate multiple SNPs simultaneously when k is much greater than n, the maximum number of SNPs that can be evaluated in the model is limited by sample size; that is, the method usually cannot test all SNPs simultaneously in large-scale genetic association studies, such as genome-wide association studies.

To evaluate all SNPs simultaneously in one statistical model, we introduced the Bayesian classification with singular value decomposition (BCSVD) method [2]. The BCSVD method can be applied to a dichotomous response variable when k is much greater than n. The method achieves a massive dimension reduction by applying singular value decomposition to the design matrix in a binary probit model; it estimates the effect of SNPs through the reduced model. Selection of significant SNPs can be achieved by using the empirical p-values obtained from permutation. The BCSVD method handles small sample sizes quite well.

To analyze quantitative traits when k is much greater than n, we further developed the Bayesian regression with singular value decomposition (BRSVD) method. We applied the BRSVD method to the sequence data provided by Genetic Analysis Workshop 17 (GAW17). We show that the BRSVD method is a practical method that can be used to analyze sequence data by comparison to the single-SNP association test and PR methods.

Methods

BRSVD method

Let us consider the standard regression model in the matrix form:

(1)

where y_n_×1 is a vector of quantitative dependent variables, X_n_×_k is the design matrix, β_k_×1 is a vector of parameters to be estimated, I_n is an n × n identity matrix, and σ² is an unknown variance; as before, k and n are the number of SNPs and the number of samples, respectively. By applying singular value decomposition (SVD) to the design matrix X′ = ADF′, the model in Eq. (1) with the SVD of X can be written:

(2)

where L = FD and:

(3)

As in Kwon et al. [2], we call γ a superfactor vector because it is expressed as a linear combination of the original parameters β. The statistical inference will be held on the superfactor vector instead of on β. From Eq. (2), the likelihood function of y given (γ, σ²) can be obtained as:

graphic file with name 1753-6561-5-S9-S57-i4.gif

(4)

where:

(5)

and Inline graphic is the maximum-likelihood (or least-squares) estimator of γ. Let us choose prior densities for (β|σ²) and σ² as:

(6)

and

(7)

where IG is the inverted gamma distribution and (β*, m, a, b) are known hyperparameters. Because γ = A′β, the conjugate prior density on β implies the conjugate prior density on γ so that:

(8)

Thus the prior density on (γ, σ²) can be expressed as:

graphic file with name 1753-6561-5-S9-S57-i10.gif

(9)

The joint posterior distribution for (γ, σ²) can be obtained by multiplying the likelihood function in Eq. (4) to the prior density in Eq. (9):

graphic file with name 1753-6561-5-S9-S57-i11.gif

(10)

where

(11)

(12)

(13)

(14)

(15)

The marginal densities for γ and σ² can be obtained by integrating Eq. (10) with respect to σ² and γ, respectively. Given the observed data, the marginal posterior density for γ is a multivariate Student’s t distribution in which each element is a Student’s t distribution with (n + a) degrees of freedom and the marginal density for σ² is:

(16)

With these posterior distributions, the γ can be estimated through a Markov chain Monte Carlo simulation with Gibbs sampler, which starts with the maximum-likelihood estimate. To transform the superfactor vector (γ) in Eq. (2) back to β, which is our original parameter of interest vector, we use the most general solution form for the linear equation (γ = A′β) and achieve the unique solution for β by choosing the generalized inverse of A′ as A [3]. We use a permutation test to estimate the significance of the SNP effects on the phenotype. Let Inline graphic be the estimate of the ith SNP effect from the raw data, and let be the estimate of the ith SNP effect from the jth shuffled data that were obtained by permuting the quantitative trait (y). Define as the difference between and . Then the test statistic can be defined as:

(17)

where Inline graphic is the sample mean of and is the standard error of . Under the null hypothesis (H₀: β_i = 0), the statistic Λ_i follows the standard normal distribution when J is large:

(18)

Study sample and association analysis

We used the unrelated individuals data distributed by GAW17, which includes 697 individuals, 24,487 SNPs, and 3 covariates (sex, age, and smoking status). We analyzed the first 10 replicates of phenotypes for quantitative risk factor Q1. We first performed the single-SNP association test using the simple linear regression model option in PLINK [4]. Second, we applied the PR method with L1 penalty introduced by Tibshirani [1] using the R package monomvn [5]. We evaluated SNP association with Q1 within the maximum number of SNPs allowed by the package in each step, which is min(k, n − intercept). Because the package does not provide p-values, we used the same permutation technique as in the BRSVD method to obtain empirical p-values. Third, we implemented the BRSVD method. To define significant SNPs for each method, we considered the following statistical models: quantitative risk factor Q1 versus the single SNP and the three covariates for the single-SNP association test; quantitative risk factor Q1 versus the maximum number of SNPs allowed by the package plus the three covariates for the PR method; and quantitative risk factor Q1 versus all SNPs (24,487) and the three covariates for the BRSVD method. All SNPs identified as significant for each model were compared to the 39 SNPs listed in the answer sheet distributed by GAW17. The analyses were run for each of the first 10 replicates, and the average of the 10 replicates was summarized (see Results section).

Results and discussion

Single-SNP association

Using a p-value less than 10⁻⁵, which is an approximate value of 0.05 genome-wide level using Bonferroni correction, as the cut point, the single-SNP test identified age and 50 SNPs as the risk factors for Q1 (Figure 1). By comparison with the answer sheet distributed by GAW17, which listed 39 SNPs that were associated with Q1, only 2 SNPs (C13S522 and C13S523) out of the 50 were correctly identified.

PR method

The results from the PR method are shown in Figure 2. With the cut point of p = 0.05 (i.e., −log₁₀(p) = 1.3), age was again identified as a risk factor for Q1. In addition, 15 SNPs were also found to be significant. However, only 3 SNPs (C13S523, C13S522, and C4S1884) out of the 15 were on the 39 risk SNPs list.

**Association results from the penalized regression method***. x*-axis: All SNPs on chromosomes 1–22 are numbered from 1 to 24,487. y-axis: −log₁₀(p-value). The three correctly identified SNPs are given.