Abstract
A genome-wide association study involves examining a large number of single-nucleotide polymorphisms (SNPs) to identify SNPs that are significantly associated with the given phenotype, while trying to reduce the false positive rate. Although haplotype-based association methods have been proposed to accommodate correlation information across nearby SNPs that are in linkage disequilibrium, none of these methods directly incorporated the structural information such as recombination events along chromosome. In this paper, we propose a new approach called stochastic block lasso for association mapping that exploits prior knowledge on linkage disequilibrium structure in the genome such as recombination rates and distances between adjacent SNPs in order to increase the power of detecting true associations while reducing false positives. Following a typical linear regression framework with the genotypes as inputs and the phenotype as output, our proposed method employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a first-order Markov process along the sequence of SNPs that incorporates the prior information on the linkage disequilibrium structure. The Markov-chain prior models the structural dependencies between a pair of adjacent SNPs, and allows us to look for association SNPs in a coupled manner, combining strength from multiple nearby SNPs. Our results on HapMap-simulated datasets and mouse datasets show that there is a significant advantage in incorporating the prior knowledge on linkage disequilibrium structure for marker identification under whole-genome association.
Key words: : association mapping, lasso, linkage disequilibrium
1. Introduction
Recent advances in high-throughput genotyping technology have allowed researchers to generate a high volume of genotype data at a relatively low cost. An association study involves examining genotypes and phenotypes such as disease status, gene expression, and clinical traits over individuals in a population in order to discover genetic markers whose variations give rise to the variation in the given phenotype. Genetic variations in a single nucleotide, called single nucleotide polymorphisms (SNPs), provide a useful set of genetic markers for such association studies, since they are relatively common across the genome. In addition, the relatively low cost involved in typing a large number of such SNPs allow a whole-genome association study feasible. The challenge in this type of study is to identify a small subset of SNPs associated with the phenotype among the full set of SNPs that can be as large as 8 million (The International HapMap Consortium, 2005).
In a simple single-marker test which has been widely used for detecting an association (Stranger et al., 2005; Cheung et al., 2005), one examines the correlation between the given phenotype and frequencies of each polymorphic allele for one SNP marker at a time to compute p-value of the SNP, and finds the SNPs with low p-values as significant. Instead of examining a single SNP at a time as in a single-marker association test, a different approach based on a multivariate linear regression method that considers all of the SNPs jointly in a single statistical model has been used to discover the causes of complex diseases that are believed to be controlled by multiple genetic loci. (Servin and Stephens, 2007; Shi et al., 2007; Malo et al., 2008; Yi and Xu, 2008; Wu et al., 2009). In this approach, regression coefficients for each of the SNPs are estimated, using the SNPs as input variables and the phenotype as output variable, and the estimated regression coefficients are used to determine the significance of the association for each SNP. In order to address the challenge of reducing false positives and select a small number of true relevant inputs (SNPs) affecting the phenotype, regularized regression methods such as ridge regression and lasso have been used that encourage sparsity in the estimated model by setting many regression coefficients for irrelevant SNPs to zero or values close to zero (Shi et al., 2007; Malo et al., 2008).
These approaches based on a single-marker test or multivariate regression methods assume that the SNPs are independent of each other, and do not take into account an important correlation structure, called linkage disequilibrium (LD), present in the sequence of SNPs. Although lasso has been known to be one of the most effective methods among the sparse regression methods (Tibshirani, 1996), it has been shown that lasso fails to recover the true relevant inputs (SNPs), if there are high correlations between relevant inputs or between relevant and irrelevant inputs, even when the sample size is large (Zhao and Yu, 2006). In reality, the states of SNPs that are adjacent in the genome can be tightly coupled (i.e., in LD). When an individual inherites a chromosomal material from each of the parents, a recombination event can break the parental chromosomes into non-random inheritable segments, causing SNPs within each segment to be inherited with high probability, and preventing random combinations of all possible SNP states within the segment. Since the recombination sites are non-uniformly distributed across the genome, recombination events in chromosomes over generations lead to a block structure among SNPs on the chromosome. As illustrated in Figure 1, each chromosome is a mosaic of ancestor chromosomes, where segments of SNPs of the same color have been inherited from the same ancestor chromosome. The true association SNPs are indicated by × in Figure 1. Since a chromosome segment carrying causal alleles can be inherited as a block, we can take advantage of this block structure to increase the power for detecting associations by considering a block of linked SNPs jointly.
FIG. 1.
An illustration of linkage disequilibrium structure in genotypes in an association study.
A multi-marker approach has been proposed to take into account this LD pattern by testing a short segment of SNP markers, or a haplotype, for an association (Zaitlen et al., 2006; Zhang et al., 2002). The main problem with this approach is that it is not obvious how many SNPs should be included in a haplotype for an association test. Haplotypes of fixed length with a sliding window to scan the genome does not offer flexibility, since the length of the block can vary depending on where the recombination sites have occurred. On the other hand, an exhaustive approach that considers haplotypes of all lengths up to a maximum can result in a serious multiple testing problem. Several methods have been developed to address this problem of determining the size of haplotypes to be considered for an associations. One such approach used a variable-length Markov chain to represent haplotypes, and tested its edges for associations, avoiding the problem of explicitly determining haplotype boundaries (Browning, 2006). A different approach based on a regularized regression method used haplotypes of all possible lengths within a window of a fixed size as inputs, and fit an L1-regularized regression to select haplotypes relevant to the phenotype under consideration (Li et al., 2007). The selected haplotypes with non-zero regression coefficients were considered as informative, and were tested for association in a further analysis.
Although these approaches incorporated a block structure with possibly varying haplotype lengths in the sequence of SNPs, their strategy for determining the block boundaries did not consider the effects of recombination events that played the role of forming such block structure in the first place. Ideally, we would like to place block boundaries between two SNPs that are decoupled by a recombination event. While we do not directly observe where recombinations occured in the genome, a lot of research effort has been devoted to understanding the LD structure caused by non-random recombination events, and many statistical methods have been developed to estimate recombination rates (Li and Stephens, 2003; Sohn and Xing, 2007). Distances between a pair of adjacent SNPs also provide useful information, since SNPs that are far apart on the chromosome are more likely to be decoupled by a recombination event. Thus, our goal is to take advantage of the information available on the LD structure such as recombination rates estimated by previously developed statistical methods and distances between adjacent SNPs in order to decide the block boundaries for relevant SNPs.
Sparse regression methods such as fused lasso (Tibshirani et al., 2005) and group lasso (Yuan and Lin, 2006) have been proposed previously which assume a structure in inputs when performing a variable selection. However, these methods have limitations when applied to the problem of association mapping, since they do not allow one to take into account the LD structure in a flexible manner. For example, while fused lasso (Tibshirani et al., 2005) encourages the regression coefficients for two adjacent inputs (or SNPs) to have the same value by fusing them, the amount of such fusion does not depend on the structural information such as recombination rates and distances between two SNPs. Even if two SNPs are located far apart on the chromosome with a low correlation and thus, unlikely to influence the phenotype jointly with the same regression coefficients, fused lasso does not treat these SNPs differently from two tightly linked SNPs. Group lasso (Yuan and Lin, 2006) assumes that the group structure in inputs (SNPs) are known a priori and uses the groups of inputs instead of individual inputs as a unit for variable selection. Since the haplotype block boundaries are not known, it is not obvious how to define groups of SNPs to be used in association mapping with group lasso.
In this article, we propose a statistical model, called stochastic block lasso, for association mapping that explicitly takes advantage of such prior information on the LD structure. The model combines information among correlated SNPs within each LD block, and considers the set of SNPs in each block jointly to decide whether the block consists of irrelevant SNPs or forms a candidate region with one or more causal SNPs. The model determines boundaries of such LD blocks probabilistically based on the prior information on the LD structure, so that the block boundaries are more likely to occur between a pair of SNPs with a high recombination rate and a large distance.
We build on a linear regression model that encourages sparsity in estimated regression coefficients to select a small number of true relevant SNPs, and extend it by augmenting the model with a prior probability distribution that incorporates the LD structure. In order to enforce sparsity in the regression model, we place a Laplacian prior distribution on regression coefficients, similar to the L1 penalty in lasso (Tibshirani, 1996). In addition, we propose to represent the dependencies between adjacent SNPs as a first-order Markov chain, where the probability of each SNP belonging to a candidate region containing causal SNPs depends on whether the adjacent SNPs also belong to the same candidate region or haplotype. The amount of this dependency between two adjacent SNPs is controlled by the recombination rate and distance between the two neighboring SNPs, and is modeled by the Markov-chain transition probabilities which are functions of this prior information on the LD structure. The stochastic nature of the Markov-chain prior distribution allows us to determine the block boundaries adaptively given data, without having to rely on fixed-sized blocks of SNPs. We focus on continuous-valued phenotypes, although the method can be extended in a straightforward manner to a logistic regression to handle discrete phenotypes. In our simulation study based on HapMap data, we demonstrate that when a block of SNPs in LD contain multiple causal SNPs, our method significantly improves the power of detecting associations by combining information across correlated SNPs. Even under the scenario of a single causal SNP in an LD block that a single-SNP analysis assumes, we show that stochastic block lasso outperforms other traditional methods. In addition, we demonstrate our method on a mouse dataset.
2. Methods
We formulate the problem of finding SNPs significantly associated with the given phenotype as a multivariate regression, where the genotypes at SNP loci are considered as inputs and the phenotype as output. In this setting, the regression coefficients indicate the association strengths for each SNP, and sparsity-biasing regression methods such as lasso which set many of the regression coefficients to zero can be used to reduce false positives.
In stochastic block lasso, we consider a Bayesian formulation of lasso, called stochastic lasso, and introduce a Markov-chain prior to this method to account for the LD structure when determining the block boundary of relevant SNPs. The Markov chain allows us to determine the block boundary probabilistically, while considering prior information on linkage disequilibrium structure such as recombination rates and distances between adjacent SNPs. Thus, unlike the previously developed sparse regression methods such as group lasso (Yuan and Lin, 2006), which used fixed groups of inputs (SNPs), our method allows the block boundaries to be unknown a priori.
Below, we give a brief overview of regularized regression methods including lasso, introduce stochastic lasso as lasso in a Bayesian setting with a Laplacian prior, and describe how to enhance this model to stochastic block lasso with a Markov-chain prior.
2.1. Genetic association mapping and regularized regression methods
In this section, we set up a multivariate regression framework for genetic association mapping, and describe lasso as an association analysis method for identifying few causal SNPs while controlling false positives.
Let X be an N × J input matrix, where each element xij takes values from {0, 1, 2} according to the number of minor alleles in the genotype of the jth SNP of the ith individual. Note that in our analysis, the J SNP markers are assumed to be ordered in terms of their positions on chromosome. In addition, we let y denote an N × 1 vector of measurements for a given phenotype for the same N individuals. Assuming a strictly additive genetic effect of each allele, we apply a standard linear regression model as follows:
![]() |
where β is a vector of J regression coefficients
, and the noise ε is modeled as having a normal distribution with mean 0 and variance σ2. The regression coefficients β can be estimated by the standard method of optimizing the sum of squared residuals as follows:
![]() |
Given this estimate of β, the SNPs with high regression coefficients are generally considered as relevant to the given phenotype.
In a typical genome-wide association mapping, a large number of marker loci are examined with the goal of identifying (usually a small number of ) genetic markers truly associated with the phenotype. A naive application of the standard regression method in Equation (2) cannot handle such a high-dimensional problem with large J and many irrelevant SNPs. In order to obtain a stable and interpretable estimate of the regression coefficients β under large J, regularized regression methods such as ridge regression (Malo et al., 2008) and lasso (Tibshirani, 1996; Wu et al., 2009) have been introduced which set the regression coefficients for irrelevant SNPs to zero or values near zero in the estimate of β. Generally, estimating regression coefficients in a regularized regression can be formulated as the following optimization problem:
![]() |
where T(β) is a regularization function that influences the sparsity of the estimated β by penalizing the sum of squared residual criterion, the first term in Equation (3). The λ in Equation (3) is the regularization parameter that controls the amount of penalization and thus, sparsity. A large value of λ leads to more penalization by T(β), causing the estimated β to be more sparse with more SNPs having zero or near-zero regression coefficients. Ridge regression with
shrinks the regression coefficients toward zero, but it does not set the regression coefficients of irrelevant markers to exactly zero. Lasso that penalizes the sum of squared residuals with T(β) = Σj ∣βj∣ has the property of setting the regression coefficients for non-causal SNPs exactly to zero. Lasso has been shown to be very effective in selecting relevant inputs (SNPs) (Tibshirani, 1996) and has been previously applied to the genetic association mapping problem (Shi et al., 2007). However, under the standard lasso framework, we are not guaranteed to recover the true relevant SNPs even with a large number of samples N, when the SNPs are strongly correlated as is the case in genomic data.
2.2. Stochastic block lasso for association mapping
Instead of formulating the association-mapping problem as an optimization with a penalty term as in lasso in Equation (3) with T(β) = Σj ∣βj∣, we set up an equivalent full probability model, stochastic lasso, that describes how the observations for the phenotype are generated given genotypes, and place prior distributions on the unknown parameters in a Bayesian setting. Working within this probabilistic framework provides us with a natural way of introducing a Markov chain as a prior probability distribution to model the dependencies between SNPs as we detail in this section. Below, we describe stochastic lasso for genetic association mapping, and then, introduce a new method, stochastic block lasso.
2.2.1. Stochastic lasso as an association model
In stochastic lasso, instead of optimizing Equation (3) to obtain a point estimate of the regression coefficients β, we use a Bayesian formulation, and obtain a full posterior distribution of β, given observed data {X, y} and a prior distribution on β. Using Bayes' rule, this posterior distribution can be written as follows:
![]() |
where we introduced a set of indicator variables
representing whether the jth SNP is a candidate causal SNP or not. The first term on the right-hand side of Equation (4) is the likelihood of the phenotype y given the genotypes X and regression coefficients β, and is modeled as N(Xβ, σ2). The next two terms p(β∣c) and P(c) in Equation (4) are called prior distributions, and are defined as we detail below.
We design the prior distribution on β, p(β∣c), such that it encourages sparsity, and model each βj as being generated from a conditional probability distribution given cj as follows:
![]() |
where λ is the parameter that controls the amount of sparsity. In Equation (5), if cj = 0, the jth SNP is irrelevant with respect to the given phenotype, and the corresponding regression coefficient βj is set to 0. If cj = 1, the jth SNP is a candidate causal SNP, and we model βj as coming from a Laplacian distribution. As the term ∣βj∣ in the Laplacian distribution in Equation (5) implies, using the Laplacian distribution for sparsity in stochastic lasso is equivalent to using the lasso penalty in Bayesian setting (Tibshirani, 1996). Especially when the posterior distribution is summarized using the mode of the distribution, the problem of finding this mode under the given likelihood model for p(y∣X, β) and the Laplacian prior on βj's corresponds to optimizing Equation (3) under the lasso penalty T(β) = Σj ∣βj∣. With a small value of λ, the Laplacian function in Equation (5) is highly peaked around zero, and the βj tends to be more shrunk toward zero, resulting in a more sparse model with only few SNPs having large regression coefficients. Instead of using a fixed value for λ or determining the value of λ through a cross-validation, we model λ as coming from a prior distribution Inv-gamma(α, γ), and learn the appropriate value of λ during the estimation process (Park and Casella, 2008). We set the prior for σ2 to Inv-gamma
.
In a Bayesian setting, models similar to the one described above have been used in the literature of Bayesian variable selection with various different distributions in Equation (5) (George and McCulloch, 1993; Ishwaran and Rao, 2005; Yuan and Lin, 2005; Yi and Xu, 2008). In most of these works, the prior distribution on c, P(c), was defined such that the cj's come from a Bernoulli distribution with parameter p, assuming that the cj's are independent of each other. This independence assumption is not appropriate when applied to the problem of association mapping since nearby SNP markers are known to be highly correlated from the linkage disequilibrium. In the next section, we propose to use a Markov chain to model the dependencies among cj's in order to take advantage of the correlation information in the genome.
2.2.2. Markov chain prior for block structure
Because of the non-random nature of the recombination process, blocks of genetic material are inherited from one generation to the next generation instead of an individual nucleotide. Thus, when a disease-causing mutation occurs on a chromosome, the SNPs surrounding the causal SNP that have not been broken apart by a recombination event are tightly linked to the causal SNP with a high linkage disequilibrium. Because of this linkage disequilibrium structure, we can consider a block of highly correlated SNPs instead of a single SNP for an association, and determine whether these SNPs jointly form a candidate region containining one or more causal SNPs, or are jointly irrelevant. Combining information across multiple correlated SNPs in this manner can potentially increase the power for detecting causal SNPs while reducing false positives.
Although the information on where recombination events have occurred on the genome is not directly available to allow us to determine the exact locations of block boundaries of linked SNPs, we can exploit the available information on the linkage disequilibrium structure such as recombination rates and distances between adjacent SNPs to gain insights on how tightly nearby SNPs are correlated as well as where haplotype block boundaries occur. In a region with a high recombination rate, the SNPs previously linked in ancestor chromosomes are likely to be decoupled in descendant chromosomes, whereas a segment of tightly linked SNPs is preserved in the absence of recombination during inheritance. There is a large body of literature on estimating recombination rates given genotype data of unrelated individuals from a population (Fearnhead and Donnelly, 2001; Li and Stephens, 2003; Sohn and Xing, 2007), and any of these methods can be used to estimate recombination rates as a part of preprocessing step prior to the association analysis with our method. In addition, two SNPs that are far apart in distance on the chromosome are likely to be decoupled over time by a recombination event, and these distances between adjacent SNPs can be easily computed from the marker positions.
Assuming that SNPs that are tightly linked with each other are likely to jointly form a candidate region for causal SNPs, our goal is to incorporate the prior information on LD structure such as recombination rates and distances in a statistical model in order to probabilistically determine the block boundaries of haplotype that contains SNPs significantly associated with the phenotype. In stochastic block lasso, we accomplish this by modeling dependencies across SNPs as a first-order Markov chain whose transition probabilities are defined in terms of recombination rates and distances between pairs of adjacent SNPs as we describe below.
Instead of modeling cj's as independent of each other as in stochastic lasso, stochastic block lasso assumes that the indicator variables cj's are ordered in terms of their positions on chromosome, and introduce first-order dependencies among them as a Markov chain as follows:
![]() |
In this Markov chain, the probability that the jth SNP is relevant depends on whether the previous (j − 1)th SNP is relevant or not, and is independent of all of the SNPs before the (j − 1)th locus, given the assignment for cj−1 for the (j − 1)th SNP. As we move along the sequence of the J SNPs, the assignments for neighboring cj's are coupled because of the Markov dependency, and overall, the values of cj's form blocks of 0's or 1's. Although higher-order Markov chains could be used to model long-term dependencies beyond two adjacent SNPs more accurately, they would significantly increase the computation time for estimating the regression coefficients. In our simulation study, we found that the first-order Markov chain provided a reasonably good approximation, and significantly increased the power for detecting true causal SNPs.
We assume that the recombination rates
and distances
are available as prior knowledge, where the index j for ρj and dj represents the interval between the (j − 1)th and jth SNPs. Given this prior information, we define P(cj∣cj−1) in Equation (6) such that the cj and cj−1 are encouraged to take the same values if the ρj and dj are small. Towards this goal, we introduce a Poisson process model, a commonly used model for a recombination process (Li and Stephens 2003), as below:
![]() |
where Π is the transition probability matrix
. The first term on the right-hand side of Equation (7) corresponds to the case of no recombination events between the (j − 1)th and jth SNPs, and thus, cj and cj−1 stay in the same block with cj = cj−1. The second term in Equation (7) models the presence of recombination events between the two SNPs, and, thus, a haplotype block boundary occurs between the (j − 1)th and jth SNPs. When the model transitions from one haplotype block to the next haplotype block, we can either stay in the same state of being in a candidate region (or irrelevant), or change the state according to the transition probabilities Π. As we can see in Equation (7), if the distance dj between the (j − 1)th and jth SNPs is small or the recombination rate ρj is low, the first term on the right-hand side of Equation (7) is likely to be activated with the high probability given as exp(−dj ρj), which will lead to both SNPs receiving the same assignments for cj−1 and cj. Thus, the cj−1 and cj are set to 1 (or 0) in a coupled manner. This idea in stochastic lasso is illustrated in Figure 2 as a diagram with nodes representing variables and arrows indicating probabilistic dependencies. The observed quantities such as observed data (X and y) and prior knowledge (ρj's and dj's) are shown as shaded nodes. We place a prior Beta(a00, b00) on π0, and Beta(a10, b10) on π1.
FIG. 2.
An illustration of stochastic block lasso.
2.3. Parameter estimation
Because of the non-differentiability of the function used in Equation (5) for βj when cj = 0, it is not possible to learn parameters of stochastic block lasso using an EM-style algorithm commonly used for hidden Markov models. Instead, we use the Gibbs sampling to sample from the posterior probability distribution of the parameters Θ = {β, σ2, Π} given data p(Θ∣y, X), and summarize results by computing mean of the samples or selecting the sample with the highest posterior probability. In practice, we augment the set of unknown parameters with the hidden variables c, and sample from their joint posterior distribution p(Θ, c∣y, X), since the augmentation allows a convenient form for the conditional posterior of each parameter. In each iteration of the Gibbs sampling, we draw a sample for each parameter alternately from its conditional posterior distribution. It has been shown that after many such iterations, the Gibbs sampler reaches a convergence and produces samples from the full joint posterior distribution (Gelman et al., 2004). We include the conditional posterior probabilities for the parameters and hidden variables for the Gibbs sampling and their derivations in Appendix.
3. Results
3.1. Simulation study
We performed a simulation study to compare the performance of stochastic block lasso with those of different association methods including a single-SNP regression analysis and multivariate regression methods such as ridge regression, lasso, and stochastic lasso.
We simulated genotypes of 150 SNPs for 500 individuals based on chromosome 7 of the HapMap CEU population. In addition to the genotypes of the 60 parents in the original HapMap data, we generated data for additional individuals by randomly mating the haplotypes of the 60 parents, and then, removed SNPs with minor allele frequency less than 0.10. Given the genotypes, we set the true association strengths β for each of the 150 SNPs as follows. We randomly selected 5 SNPs within a region of 15 SNPs as causal SNPs, given that these 15 SNPs formed an LD block, and then set the βj's to 0.4 for the chosen causal SNPs and 0 for the other SNPs. Using the simulated genotypes and true association strengths, we simulated values for the phenotype using the linear relationship in Equation (1) with noise distributed as N(0, 1). Finally, we used the widely used software Phase 2.1.1 (Li and Stephens, 2003) to estimate recombination rates ρj's between pairs of adjacent SNPs.
For stochastic lasso and stochastic block lasso, which use the Gibbs sampling algorithms to estimate parameters, we ran the sampling algorithm for 4000 iterations after 2000 burn-in iterations, and used the samples from every 10 iterations to estimate the association strengths. In stochastic block lasso, we set the priors for π0 and π1 to Beta(5,5), which reflects the fact that we are a priori neutral as to where causal SNPs occur in the genome. Similarly, in stochastic lasso, we set the prior for the parameter p of the Benoulli distribution to Beta(5,5). For lasso and ridge regression that require an estimation of the regularization parameter, we fit each model on 450 individuals after witholding 50 individuals as a validation set, computed a squared error using the validation set given the estimated β, and then selected the regularization parameter that gives the lowest validation-set error. Once the regularization parameters were chosen, we used the entire dataset of 500 individuals to obtain the final estimate of association strengths.
To illustrate how different association analysis methods behave, we fit stochastic block lasso and other methods on a single dataset simulated with βj = 0.6 as association strength, and show the results in Figure 3. In Figure 3A, the LD structure in the genotypes is shown as an image of r2's between pairs of SNPs. The true association strengths used in the simulation are shown in Figure 3B, where the true causal SNPs lie within an LD block. In the results from single-SNP regression analyses in Figure 3C, almost all of the non-causal SNPs that are only in LD with the true causal SNPs were found as significantly associated with the phenotype as true causal SNPs. Ridge regression in Figure 3D estimated non-zero associations for all of the SNPs, since it does not have the property of encouraging sparsity in the estimated regression coefficients and setting βj's for irrelevant SNPs to zero. On the other hand, lasso in Figure 3E set many of the regression coefficients for the irrelevant SNPs to zero, although we generally found that the lasso estimate tends to be overly sparse as we demonstrate later in this section. Bayesian methods such as stochastic lasso in Figure 3F and stochastic block lasso in Figure 3G are concerned with probability distributions over whether each SNP is relevant or not (cj's shown as the red line) as well as the corresponding association strengths (βj's shown as blue ×'s). Thus, although the results were generally sparse for these Bayesian methods, there was more flexibility in the level of sparsity than in lasso which sets the regression coefficients strictly to either zero or non-zero. Since the stochastic lasso treats each SNP as independent of the other SNPs, there was no information sharing among the correlated SNPs, and as a result, there were many false positives across the whole genome. On the contrary, stochastic block lasso suppressed false positives among the correlated non-causal SNPs, and identified blocks of causal SNPs, where the block structure reflected the LD structure. Throughout our simulation experiments, we found that the groups of relevant SNPs as represented in cj's estimated by stochastic block lasso tended to extend to a slightly larger interval than the region of true causal SNPs. Thus, we can view cj's as suggesting candidate regions of relevant SNPs, and βj's as determining how significantly associated the SNPs with cj = 1 are.
FIG. 3.
Results of association analysis by different methods based on a single simulated dataset. (A) The LD structure in the genotypes. (B) True association strengths of the SNPs used in simulation. (C) -log(p-values), where p-values were obtained from a single-SNP regression analysis. Values of the estimated regression coefficients are shown as blue ×'s for (D) ridge-regression, (E) lasso, (F) stochastic lasso, and (G) stochastic block lasso. In (F) and (G), the red lines show estimated cj's that represent the SNPs estimated as candidate region for associations. For stochastic lasso and stochastic block lasso, the results are shown for a single sample with the lowest train error from the MCMC sampling algorithm.
The effect of the Markov-chain model on cj's in stochastic block lasso could be seen even more clearly in the posterior probability distribution of cj's, P(cj∣X, y), in Figure 4A and Figure 4B for stochastic lasso and stochastic block lasso, respectively. In Figure 4, the posterior distributions of cj's are summarized as the mean of the 4000 samples for cj's from the Gibbs sampling algorithm, and are shown as red lines. The locations of the true causal SNPs are marked as blue o's. Stochastic block lasso encouraged block structures among relevant SNPs as well as among irrelevant SNPs, leading to a smoother variation in P(cj∣X, y)'s along the genome, compared to the model with independent Bernoulli prior. In addition, for stochastic block lasso in Figure 4B, the probabilities were more highly peaked around the true causal SNPs, and were closer to zero for non-causal SNPs, compared to stochastic lasso in Figure 4A.
FIG. 4.
Comparison of stochastic lasso and stochastic block lasso based on a single simulated dataset. Probabilities of each SNP being relevant, P(cj)'s, estimated using the same simulated dataset as in Figure 3 are shown as red lines for (A) stochastic lasso and (B) stochastic block lasso. The o's indicate the locations of the true relevant SNPs.
In order to assess the performances of different association analysis methods quantitatively and systematically, we simulated 50 such datasets, and compared different association methods in terms of sensitivity and specificity averaged over the 50 datasets. The sensitivity and specificity measure whether the given method can successfully detect the true association SNPs with few false positives. The 1-specificity and sensitivity are equivalent to type I error rate and 1-type II error rate, and their plot is widely known as a receiver operating characteristic (ROC) curve. To obtain an ROC curve for a given association analysis method, we estimated association strengths from each simulated dataset, and ranked the SNPs according to the estimated association strengths. Then, we varied the threshold for determining which SNPs have significant associations from the top 1 SNP to top 150 SNPs in the ranked list of the SNPs, and for each threshold, recorded sensitivity and specificity among those SNPs above the threshold to obtain an ROC curve across all of the possible thresholds. We repeated this process for each of the 50 simulated datasets, and computed the average of the 50 ROC curves to evaluate the given association analysis method. In general, we are mainly concerned with the thresholds that include few SNPs at the top of the ranked list, since only very few SNPs are thought to influence the phenotype in a typical genetic association study. We used -log(p-values) as a measure of association strengths for a single-SNP analysis, the absolute values of βj's for ridge regression and lasso, and the absolute values of βj's averaged over 4000 samples from MCMC sampling iterations for the two Bayesian methods, stochastic lasso and stochastic block lasso. Below, we consider various scenarios that arise in a typical genetic association study, and compare the performances of different association methods under each of these scenarios.
3.1.1. Varying signal-to-noise ratios
First, we varied the association strengths of the true relevant SNPs from 0.4 to 0.6, 0.8, and 1.0 to evaluate how signal-to-noise ratios affect the performance of different association analysis methods. For each value of the association strength, we set the association strengths for all of the causal SNPs to the same value, simulated 50 datasets of sample size N = 500, and performed an association analysis using the different methods. The ROC curves averaged over the 50 datasets are shown in Figure 5. We found that across the range of signal-to-noise ratios in Figure 5A–D, stochastic block lasso consistently showed a greater power with lower false-positve rates than any other methods. In general, lasso that encourages sparsity through L1 penalty gave overly sparse estimates of βj's, and Bayesian methods such as stochastic lasso and stochastic block lasso that assign a probability of being relevant to each SNP outperformed lasso as well as other methods.
FIG. 5.
ROC curves comparing the performance of association analysis methods when the association strength varies. Panels show results for association strengths (A) 0.4, (B) 0.6, (C) 0.8, and (D) 1.0. The number of SNPs with true associations was set to 5, and the sample size N = 500 with the LD level of the original HapMap genotypes was used. Results were averaged over 50 simulated datasets.
3.1.2. Varying sample sizes
We examined how the sample size affects the performance of association analysis methods by varying the sample size from 300 to 500 and 800. The association strengths for causal SNPs were fixed at 0.4. The ROC curves averaged over 50 simulated datasets are shown in Figure 6. For all of the sample sizes, we found that stochastic block lasso outperformed all of the other methods. This is because many of the SNPs were highly correlated in the genome, and stochastic block lasso was able to take advantage of the correlation structure in the genome to set the association strengths for SNPs in LD jointly to zero or non-zero. In particular, across all sample sizes, single-SNP regression analysis suffered from the effect of the LD structure, and found most of the SNPs in LD with the causal SNPs as significantly associated with the phenotype, as we have seen in Figure 3C. Thus, as we increased the sample size, the improvement in the performance of the single-SNP analysis was very little compared to the other methods.
FIG. 6.

ROC curves comparing the performance of association analysis methods when the sample size N varies. Panels show results for sample sizes (A) 300, (B) 500, and (C) 800. The number of true causal SNPs was set to 5 with association strength 0.4, and the LD level of the original HapMap genotypes was used. Results were averaged over 50 simulated datasets.
3.1.3. Varying degrees of LD structure in genome
In order to investigate how the strength of LD in the genome influences the performance of different association analysis methods, we created datasets with different degrees of LD structure by taking every K SNPs from the original HapMap data, where K = 1, 10, 50. As K increases, the correlations among the SNPs decrease, leading to a situation with high-recombination rates between adjacent SNPs. The r2 matrix of the 150 SNPs in one of the 50 simulated datasets is shown in the top row of Figures 7A-C for each of the three K values. The block structure along the diagonal was much weaker for K = 10 in Figure 7B than for K = 1 in Figure 7A. When K = 50 in Figure 7C, there was very little LD among the SNPs, and the SNPs were almost entirely independent of each other. The ROC curves averaged over 50 simulated datasets with sample size 500 and association strength 0.8 are shown in the bottom row of Figure 7 for each K. When the correlations among the SNPs were relatively strong in Figure 7A, there was a clear advantage of taking into account the prior information on the LD structure through the Markov-chain prior in stochastic block lasso. As the correlation structure became weak with large K, this advantage disappeared, and in Figure 7C the performance of stochastic lasso and lasso that do not assume any dependencies among SNPs approached that of stochastic block lasso. However, in Figure 7C, we found that even when there is almost no LD, having the Markov-chain prior in stochastic block lasso did not negatively affect the performance, and stochastic block lasso remained competitive to the other association analysis methods.
FIG. 7.
ROC curves comparing the performance of association analysis methods when the strength of LD varies. Results were obtained for simulated data based on (A) the original HapMap data for a relatively high LD, (B) SNPs taken for every 10 SNPs from HapMap data to simulate moderate LD, and (C) SNPs taken for every 50 SNPs from HapMap data to simulate low LD. Results were averaged over 50 simulated datasets. The top row shows the LD structure in a single dataset of genotypes in each scenario for different LD levels.
3.1.4. Varying the number of true association SNPs
We varied the number of true causal SNPs within an LD region of the genome, and examined how the performances of the association analysis methods were affected. Using the sample size 500 and association strength 0.6, we simulated datasets with the number of causal SNPs 1, 5, or 8. To generate datasets with 5 causal SNPs, we randomly selected a region of 15 SNPs, and again randomly selected 5 SNPs within the region. Similarly, for datasets with 8 causal SNPs, we randomly selected a region of 25 SNPs, and then, 8 SNPs from that region. The ROC curves averaged over 50 simulated datasets are shown in Figure 8. As shown in Figure 8A, when the number of causal SNPs was 1, the two Bayesian methods performed similarly, and showed a higher power for detecting true causal SNPs than any of the other methods. We found that as the number of true causal SNPs with an additive effect increased to 5 (Figure 8B) and 8 (Figure 8C), the advantage of making use of the Markov-chain prior in stochastic block lasso increased, and stochastic block lasso significantly outperformed stochastic lasso. We also found that when multiple causal SNPs were having an additive effect on the given phenotype, the gap in the performance between single-SNP regression methods and the two Bayesian methods significantly increased in Figure 8.
FIG. 8.

ROC curves comparing the performance of association analysis methods when the number of relevant SNPs varies. Results are shown for the number of relevant SNPs (A) 1, (B) 5, and (C) 8. The association strength was set to 0.6, and the LD level of the original HapMap genotypes was used. Results were averaged over 50 simulated datasets.
3.1.5. Computation time and scalability
Scalability of stochastic block lasso and other association analysis methods can be assessed from Figure 9, where we measure the computation time of the single-SNP analysis, lasso, stochastic lasso, and stochastic block lasso, using simulated datasets of 102, 103, 104, and 105 SNPs with sample size 500. The result for lasso represents the computation time for a single run of solving the optimization problem for a fixed regularization parameter, whereas the results for the two Bayesian models include the time taken for the 4000 Gibbs sampling iterations for obtaining samples for association strengths in addition to the 2000 burn-in iterations. Thus, although lasso seems to have a shorter computation time in Figure 9 than the two Bayesian methods, lasso needs to be run multiple times for different values of the regularization parameter to select the optimal λ through cross-validation, and in practice, may take significantly more time to obtain the final estimate of association strengths than shown in Figure 9. In addition, we found that there was a very little increase in computation time when we introduce a Markov-chain prior as in stochastic block lasso, compared to stochastic lasso with Bernoulli prior. Overall, stochastic block lasso did not significantly increase the computation time, compared to other multivariate sparse regression methods, and was able to handle a dataset as large as 105 SNPs within a day. Since the computation time increases linearly with the number of SNPs, for a larger dataset, one could analyze all of the SNPs in a single model using a single computer, or parallelize the analysis by performing a sliding-window analysis for a fewer SNPs at a time using multiple processors.
FIG. 9.
Comparison of the computation time for association analysis methods. Computation time is measured for the single-SNP analysis without a permutation test for correcting multiple hypotheses, lasso for a fixed regularization parameters, stochastic lasso, and stochastic block lasso. The x-axis and y-axis are in log scale.
3.2. Mouse data
We applied the stochastic block lasso to the inbred laboratory mouse haplotype map publicly available from BROAD institute website (http://www.broad.mit.edu/mouse/hapmap). We used the measurements for the sodium intake of 25% NaCl concentration for female mice (Tordoff et al., 2007) as phenotype. We used the 33 strains that were commonly available in both the genotype and phenotype datasets. We considered the 8217 SNPs in chromosome 4 of length 154Mb as covariates. Our goal is to scan the chromosome and discover the SNPs with high genetic effects on the phenotype.
The most commonly used method for discovering SNPs highly associated with the phenotype is to perform a statistical test for the phenotype and one SNP at a time and report the SNPs with high p-values as significant. The −log(p-value)'s for the mouse data using a single-marker statistical test, the Wald test, are shown in Figure 10A. In Figures 10B–D, we show the estimated β using the stochastic block lasso, the model with independent Bernoulli prior, and the lasso, respectively. For these three regression models, we divided the whole sequence into segments of 200 SNPs, and fit the model to one segment at a time. The SNPs with high values of −log(p-value) in Figure 10A roughly corresponded to the SNPs with high βj values in Figure 10B.
FIG. 10.
Results for the mouse haplotype data (chromosome 4) and the measurements of drinking preference. (A) −log(p value). (B) Stochastic block lasso. (C) The model with independent Bernoulli prior. (D) Lasso.
4. Discussion
In this article, we proposed stochastic block lasso for genetic association mapping that uses a Markov chain prior to encode prior information on the linkage disequilibrium structure within genotypes such as distances and recombination rates between adjacent SNP markers. We demonstrated on the HapMap-simulated datasets that stochastic block lasso can increase the power for detecting true associations and reduce false positves by combining information across multiple correlated SNPs.
Although in stochastic block lasso we only considered local dependencies between two adjacent SNPs, this idea of using a Markov chain for modeling such dependencies can be extended to model long-term dependencies in the genome. Instead of a linear structure of the Markov chain, we can introduce links between any pairs of SNPs, when the two SNPs are located in different regions of the chromosome such as two different genes, and interact with each other to influence the given phenotype. The general class of probabilistic graphical models that has a Markov chain as a special case can be used with an arbitrary graph structure on the indicator variables c.
5. Appendix
1. Gibbs sampling for the stochastic block lasso
We derive the conditional posterior probabilities to be used in the Gibbs sampling for the stochastic block lasso.
For each of the J SNPs, we sample βj and cj from their joint posterior distribution
![]() |
We first sample cj from the marginal distribution, the second term on the right-hand side of Equation (8), after integrating out βj. Conditional on the sampled cj, we sample βj from its conditional posterior, the first term on the right-hand side of Equation (8).
In order to sample cj, we re-write the second term on the right-hand side of Equation (8) as
![]() |
Sampling from the above equation requires to compute the marginal likelihood p(y∣β−j, cj, X, σ2) after integrating out βj when cj = 0 and cj = 1. When cj = 0, the jth covariate is irrelevant, and we set βj = 0. Thus, the marginal likelihood is simply given as
![]() |
When cj = 1, we compute the integral as below.
![]() |
where zi = yi − Σk/j xikβk, and
.
Let A(−) denote the first integral and A(+) the second integral in Equation (9). Then, using a straightforward algebra, it can be shown that A(−) and A(+) are given as
![]() |
where
![]() |
Once we sample cj as described above, we sample βj from p(βj∣β−j, c, y, X, σ2) in Equation (8). When cj = 0, we set βj to 0. If cj = 1, we re-write the conditional probability distribution as
![]() |
We find that the denominator of Equation (10) is the same as what we computed in Equation (9). In fact, sampling from Equation (10) is equivalent to sampling from a mixture distribution of two components given as
![]() |
Using Equation (11), we augment βj with mj, and sample (βj, mj) by first drawing the mixture component label mj from the Bernoulli distribution and then drawing βj conditional on the mj.
The conditional posterior for σ2 is given as an inverse gamma distribution
. Next, we sample the parameters π0 and π1 of the transition matrix Π. The conditional posterior for π0 is
![]() |
where Sml = {k∣ck−1 = m, ck = l} and nml is the number of transitions from cj−1 = m to cj = l in c. We approximate n00 as
![]() |
where
is the value from the previous sampling iteration, assuming that the number of events (cj−1 = 0, cj = 0) due to
and
are proportional to their probabilities. In Equation (12), the conditional posterior for π0 is Beta(n00 + a00, n01 + b00) for π0. Similarly, we obtain the conditional posterior for π1 as Beta(n11 + a10, n10 + b10).
Finally, we sample λ of the Laplacian prior from its conditional posterior
![]() |
where J′ is the number of SNPs with cj = 1 in the current sampling iteration, and
is the set of such SNPs.
Acknowledgment
E.P.X. was supported by grants ONR N000140910758, NSF DBI-0640543, NSF CCF-0523757, NIH IR01GM087694, and an Alfred P. Sloan Research Fellowship.
Disclosure Statement
No competing financial interests exist.
References
- Browning S.2006. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78, 903–913 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheung V., Spielman R., Ewens K., et al. 2005. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fearnhead P., and Donnelly P.2001. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A., Carlin J., Stern H., et al. 2004. Bayesian Data Analysis. Chapman & Hall/CRC, New York [Google Scholar]
- George I., and McCulloch R.1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88, 881–889 [Google Scholar]
- Ishwaran H., and Rao J.2005. Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Stat. 33, 730–773 [Google Scholar]
- Li N., and Stephens M.2003. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics 165, 2213–2233 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y., Sung W., and Liu J.2007. Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am. J. Hum. Genet. 80, 705–715 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malo N., Libiger O., and Schork N.2008. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am. J. Hum. Genet. 82, 375–85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park T., and Casella G.2008. The Bayesian lasso 103, 681–686 [Google Scholar]
- Servin B., and Stephens M.2007. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3, 1296–1308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi W., Lee K., and Wahba G.2007. Detecting disease causing genes by lasso-patternsearch algorithm [Technical report 1140]. Department of Statistics, University of Wisconsin, Madison, WI: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sohn K., and Xing E.2007. Spectrum: joint bayesian inference of population structure and recombination event. Proc. 15th Int. Conf. Intell. Syst. Mol. Biol [DOI] [PubMed] [Google Scholar]
- Stranger B., Forrest M., Clark A., et al. 2005. Genome-wide associations of gene expression variation in humans. PLoS Genet. 1, 695–704 [DOI] [PMC free article] [PubMed] [Google Scholar]
- The International HapMap Consortium 2005. A haplotype map of the human genome. Nature 437, 1399–1320 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R.1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 [Google Scholar]
- Tibshirani R., Saunders M., Rosset S., et al. 2005. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Series B Stat. Methodol. 67, 91–108 [Google Scholar]
- Tordoff M., Bachmanov A., and Reed D.2007. Forty mouse strain survey of water and sodium intake. Physiol. Behav. 91, 620–631 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu T., Chen Y., Hastie T., et al. 2009. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–721 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi N., and Xu S.2008. Bayesian lasso for quantitative trait loci mapping. Genetics 179, 1045–1055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M., and Lin Y.2005. Efficient empirical Bayes variable selection and estimation in linear models. J. Am. Stat. Assoc. 100, 1215–1225 [Google Scholar]
- Yuan M., and Lin Y.2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol. 68, 49–67 [Google Scholar]
- Zaitlen N., Kang K., Eskin E., et al. 2006. Leveraging the hapmap correlation structure in association studies. Am. J Hum. Genet. 80, 683–691 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang K., Calabrese P., Nordborg M., et al. 2002. Haplotype block structure and its applications to association studies: power and study design. Am. J. Hum. Genet. 71, 1386–1394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao P., and Yu B.2006. On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563 [Google Scholar]


























