Abstract
The joint action of multiple genes is an important source of variation for complex traits and human diseases. However, mapping genes with epistatic effects and gene–environment interactions is a difficult problem because of relatively small sample sizes and very large parameter spaces for quantitative trait locus models that include such interactions. Here we present a nonparametric Bayesian method to map multiple quantitative trait loci (QTL) by considering epistatic and gene–environment interactions. The proposed method is not restricted to pairwise interactions among genes, as is typically done in parametric QTL analysis. Rather than modeling each main and interaction term explicitly, our nonparametric Bayesian method measures the importance of each QTL, irrespective of whether it is mostly due to a main effect or due to some interaction effect(s), via an unspecified function of the genotypes at all candidate QTL. A Gaussian process prior is assigned to this unknown function. In addition to the candidate QTL, nongenetic factors and covariates, such as age, gender, and environmental conditions, can also be included in the unspecified function. The importance of each genetic factor (QTL) and each nongenetic factor/covariate included in the function is estimated by a single hyperparameter, which enters the covariance function and captures any main or interaction effect associated with a given factor/covariate. An initial evaluation of the performance of the proposed method is obtained via analysis of simulated and real data.
TRAITS showing continuous variation are called quantitative traits and are typically controlled by multiple genetic and nongenetic factors, which tend to have relatively small effects individually. Crosses between inbred lines produce suitable populations for quantitative trait locus (QTL) mapping and are available for agricultural plants and for animal (e.g., mouse) models of human diseases. Such crosses are often used to detect QTL. For these inbred line crosses, uniform genetic backgrounds, controlled breeding schemes, and controlled environment ensure that there is little or no confounding of uncontrolled sources of variability with genetic effects. The potential for such confounding complicates and limits the analysis and interpretation of human data. Because of the homology between humans and rodents, rodent models can be extremely useful in advancing our understanding of certain human diseases. In the past 2 decades, various statistical approaches have been developed to identify QTL in inbred line crosses (see, for example, Doerge et al. 1997 for review). To perform QTL mapping (identification), a large number of candidate positions (candidate QTL) along the genome are selected. These candidate QTL may all be located at genetic markers (positions of sequence variants in the genome where the genotypes of all individuals in a mapping population can be measured) or also in between markers if the marker density is not high. QTL mapping may then be performed by considering one candidate QTL at a time or multiple candidate QTL simultaneously. For inbred line crosses with low marker density and considering a single candidate QTL at a time, the interval-mapping method was proposed by Lander and Botstein (1989). However, these authors showed that interval mapping tends to identify a “ghost” QTL located in between two actual linked QTL if two or more closely linked QTL exist. This problem can be reduced or eliminated in two ways: (1) by using composite-interval mapping (Jansen and Stam 1994; Zeng 1994) which still performs a one-dimensional QTL search but conditional on the genotypes at a pair of markers flanking the marker interval containing the current QTL, to absorb the effects of background (nontarget QTL) outside of the target interval; or (2) by performing multiple QTL mapping, where two or more QTL are mapped simultaneously. Furthermore, if several QTL affect a quantitative trait mostly through their interactions (epistasis) while having nonexistent or weak main effects, then interval mapping or single-marker analysis will fail to detect such QTL. QTL interactions may not be limited to pairwise interactions. Marchini et al. (2005) have shown by simulation that searching for three loci jointly in the presence of a three-way interaction is more powerful than searching for a single or a pair of QTL. There are various different implementations of multiple QTL mapping. Most methods still perform only pairwise searches, with and without epistasis. The most recent methods are based on Bayesian variable selection and consider a group of candidate QTL or all candidate QTL in the genome simultaneously (e.g., Yi et al. 2007). These methods are typically still limited to pairwise interactions among QTL and do not consider gene–environment interactions.
The identification of QTL can be viewed as a very large variable selection problem: for p candidate QTL, with p typically in the hundreds or thousands and sample size in the low hundreds, there are 2p possible main-effect models, possible two-way interactions, and possible higher-order (k > 2) interactions. For inbred line crosses, where multiple-QTL mapping models can be represented as multiple linear regression models, classical variable selection methods such as forward and stepwise selection (Broman and Speed 2002) have been used in searching for main and two-way interaction effects. Bayesian analysis implemented by Markov chain Monte Carlo (MCMC) and based on the composite model space framework (Godsill 2001, 2003) has been introduced to genetic mapping (Yi 2004). Well-known Bayesian variable selection methods such as reversible jump MCMC (Green 1995) and stochastic search variable selection (SSVS) (George and McCulloch 1993) are special cases. SSVS and similar methods employ mixture priors for the regression coefficients, which specify different distributions for the coefficients under the null (effect negligible) and alternative (effect nonnegligible) hypotheses. The marginal posterior probabilities of the alternative hypotheses can be used to identify a subset of important parameters on the basis of Bayesian multiple comparison rules, including the median probability model (with a threshold of 0.5) and Bayesian false discovery rate control (e.g., Müller et al. 2006).
An alternative to variable selection with mixture priors is classical and Bayesian shrinkage- or penalty-based inference. For the classical approach of penalized regression, while an L2-based shrinkage method (ridge regression) cannot perform variable selection, other methods, in particular the L1-based lasso of Tibshirani (1996) and later lasso extensions, are capable of performing variable selection by reducing the effects of unimportant variables effectively to zero. The lasso has been applied to parametric, regression-based QTL mapping (Yi and Xu 2008). The penalized regression methods can be interpreted as Bayesian regression models with particular sparsity priors imposed on the regression coefficients (Park and Casella 2008).
Regression methods are also used for association mapping in human populations. Recently, Kwee et al. (2008) proposed a semiparametric regression-based approach for candidate regions in human association mapping, where a quantitative trait is regressed on a nonparametric function of the tagSNP genotypes within a region. They analyzed a (small) subset of the genome and tested for the joint significance of the subset. Their method potentially can be used to model interactions among SNPs and covariates. However, Kwee et al. (2008) fit their model using least-squares kernel machines, a dimension-reducing technique that is identical to an analysis based on a specific linear mixed model. Model selection for different types of kernels and different sets of variables is performed using criteria such as Akaike's information criteria (Akaike 1974) and Bayesian information criteria (Schwarz 1978), which may not be appropriate or feasible in large-scale, sparse variable selection situations.
We (Huang et al. 2010) recently developed a Bayesian semiparametric QTL mapping method, where nongenetic covariate effects are modeled nonparametrically. This method was implemented via MCMC, and a Gaussian process prior (O'Hagan 1978; Neal 1996, 1997) was placed on the unknown covariate function. The Gaussian process is particularly well suited for curve estimation due to its flexible sample path shapes. This method allows one or more nongenetic covariates to have an arbitrary (nonlinear) relationship with the phenotype. Another strong advantage of the Gaussian process is its ability to deal with high-dimensional data compared to other nonparametric techniques such as spline regression (Wahba 1984; Heckman 1986; Chen 1988; Speckman 1988; Cuzick 1992; Hastie and Loader 1993). There has been a growing interest in using Gaussian processes as a unifying framework for studying multivariate regression (Rasmussen 1996), pattern classification (Williams and Barber 1998), and hierarchical modeling (Menzefricke 2000). In this article, we build on this work and propose a nonparametric Bayesian method for multiple QTL mapping by including not only nongenetic covariates but also all candidate QTL in the unknown function. A Gaussian process prior (GPP) is again placed on the unknown function, and a variable selection approach is implemented for the hyperparameters of the GPP (one for each QTL and nongenetic covariate). Here, we rely on mixture priors and MCMC implementation, and we focus on linkage mapping in inbred line crosses, while in ongoing and future work we are considering shrinkage priors, deterministic algorithms, and association mapping. Our application of the GPP differs from “standard” applications in that the QTL covariates included in the unknown function are discrete, not continuous, with a small number (two or three) of possible values (the genotype codes). The goal of using a GPP here is not curve or response surface modeling but rather high-dimensional variable selection (QTL and nongenetic covariates) with a method requiring only a single parameter for each variable while accounting for any multiway interactions among the candidate variables.
To improve current methods for linkage mapping in inbred line crosses and for association analysis of human populations, we need to be able to detect QTL irrespective of whether they act mostly through main effects, interactions with other QTL, or interactions with environment. Fitting a parametric model including all these potential effects for a genome-wide search would substantially increase the multiple-testing problem, in addition to being computationally extremely demanding. Here we offer an alternative. We show that our nonparametric Bayesian method can identify QTL irrespective of whether they act through main effects, through interactions with other QTL, or with environmental factors. This method cannot identify the source(s) of a QTL's importance (main or interaction effects involving this QTL). Therefore, once a small number of important QTL have been identified in a genome-wide scan, then these QTL can be further analyzed with detailed parametric models to determine the source(s) of their importance.
The remainder of the article is organized as follows. We first present the nonparametric multiple-QTL model and outline the MCMC sampler in the next section. Simulation results and the analysis of a real data set are presented in the section following that. And we end the article with a discussion and conclusions.
NONPARAMETRIC REGRESSION WITH GAUSSIAN PROCESS
Model and prior:
For the ith individual, we observe (i) the genotype codes at p markers, where xik is the genotype code at the kth marker; (ii) q nongenetic covariates or factors , where tij is the value of nongenetic covariate j; and (iii) the phenotype or trait value yi. The primary goal of the analysis is to map QTL (also loosely referred to as “genes”) associated with the phenotype. Assuming for simplicity of presentation that the set of candidate QTL is the set of markers, the problem reduces to identifying which markers influence the phenotype through their genotypes. We considered the following semiparametric QTL mapping model in Huang et al. (2010),
(1) |
where η is an unknown function of nongenetic covariates that we modeled via a Gaussian process prior; βk is the partial regression coefficient associated with the kth marker; and ei is a random error with distribution N(0, ). Model (1) considers only main QTL effects, which of course can be extended to pairwise interactions by including the terms , into (1) and similarly to higher-order interactions. The explicit modeling of interactions among genes causes an increase in the number of parameters in (1) which is exponential in the order of the interactions considered. Consequences are computational difficulties and poor inferences due to small sample sizes. To overcome this problem, we can alternatively consider the following model:
(2) |
This model is flexible and considers all interactions among genes and gene–environment interactions nonexplicitly. For example, if the unknown function η(xi1, …, xip, tt1,…, ttq) = xi1xi2 + xi3ti1, then Equation 2 nonexplicitly models the two-way interaction between genes 1 and 2 and the gene–environmental interaction between gene 3 and environmental covariate 1. Let ηi = η(xi1,…, xip, ti1,…, tiq) and define .
To estimate η, we again assume that η has a Gaussian process prior (as in Huang et al. 2010) with mean 0 and with a covariance matrix Σ whose element ii′ (i ≠ i′) associated with individuals i and i′ is
(3) |
where ξ, the 's, and the 's are hyperparameters. This is the most commonly employed stationary covariance function for a Gaussian process (a detailed presentation of Gaussian processes with many valid covariance functions is in Abrahamsen 1997; see also MacKay 1998). Hyperparameter ξ defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters and are related to length scales that characterize the distance in that particular direction over which y is expected to vary significantly. When for example, = 0, then η is expected to be essentially a constant function of variable (gene) xk, which is therefore deemed irrelevant (MacKay 1998). When is large, then the resulting function has a short characteristic length and will vary rapidly along the corresponding axis of xk, indicating that variable xk is of high importance. Similarly, indicates the importance of nongenetic covariate j in combination with the genetic factors and other nongenetic covariates.
The original articles on the Gaussian process (Neal 1997; MacKay 1998) did not view this method as an approach for variable selection and imposed an inverse Gamma prior on the ρ2 parameters. Though does provide information about the relevance of any QTL k with values near zero indicating an irrelevant QTL [similar to the parameters βk in the parametric linear QTL model (1)], determining which 's are significantly nonzero is challenging. It is convenient to represent the hyperparameters in terms of their reciprocals, defined to be τxk = 1/ and . We perform Bayesian variable selection by imposing Gamma mixture priors on the parameters τxk and τtj. We introduce the latent variables γxk (γxk = 0 or 1) and γtj (γtj = 0 or 1). Then the Gamma mixture priors for the QTL associated parameters are represented as
(4) |
Here Be(p | a, b) represents the Beta density pa−1(1 − p)b−1/B(a, b), Ga(τ | a, b) represents the Gamma density τa−1exp(−bτ)ba/Γ(a), with E(τ) = μ and , and (αx0, μx0, αx1, μx1, axγ, bxγ) are hyperparameters to be specified or inferred. Similarly, for the nongenetic covariate associated parameters, we assume the mixture priors
(5) |
Note that here μx0, μx1, μt0, and μt1 are the means of the two Gamma distributions in (4) and in (5), respectively. Setting μx0 (as well as μt0) to a large value ensures that if γxk = 0, then ρxk will take on very small values, and thus the corresponding variable is irrelevant. In contrast, setting μx1 (as well as μt1) to a smaller value ensures that if γxk = 1, then a nonzero value of ρxk will be included in the model.
Define τξ = 1/ξ2, and let the prior distributions of the two parameters be Gamma and given by
(6) |
(7) |
Values for the parameters (αξ, μξ, αe, μe) are chosen prior to analysis.
The Gaussian process was originally proposed for modeling curves with continuous covariates, where the smoothness assumption on Gaussian process guarantees the smoothness of the estimated curves. However, in QTL mapping and other similar genetics analysis (Kwee et al. 2008), the primary goal is to map genes. The violation of the continuity assumption may be highly influential on the QTL effect estimation, but since the QTL effect estimation is only a secondary task in QTL mapping, the discreteness of genetic variables is less of a concern. As one extreme example, when only one gene is included, the Gaussian process model is equivalent to the random effect model where the genetic effect is treated as random with a normal distribution.
MCMC algorithm for posterior computation:
Inference is based on the joint posterior distribution of the unknown parameters (τx1, …, τxp, τt1, …, τtq, τξ, τe) and the unknown function vector η, conditional on the phenotype (y), covariate (t), and marker (x) data. One potential problem in working with this joint posterior arises due to the discrete nature of the marker data: When the number of significant markers (i.e., markers with distinctly nonzero ρxk) is small, then the covariance matrix of η, Σ, may become (nearly) singular, because multiple individuals will share the same genotype configuration at these few markers. In this case the performance of the method deteriorates, and we therefore prefer to work with the joint posterior of the unknown parameters, or the joint posterior marginalized with respect to η. Because of the normal prior on η, this marginalization is equivalent to substituting the likelihood function of y conditional on η by the unconditional likelihood of y, or
(8) |
where Σy is nonsingular even if Σ is singular. We compared inferences based on the joint posterior of the unknown parameters and η vs. the joint posterior of the parameters (using the same simulated data as presented below) and found the latter to provide clearly superior results. Therefore, from here on we consider only the marginalized posterior.
Most of the posterior computation is quite straightforward, and details can be found in supporting information, File S1. Below we describe an efficient sampling scheme, hybrid MCMC, that is essential for dealing with the large-scale QTL mapping data.
Let υ = (log(τx1), …, log(τxp), log(τt1), …, log(τtq), log(τξ), log(τe)). Due to the complexity of the covariance form (3), one cannot sample from the full conditional posterior distributions of υ directly. The Metropolis–Hastings algorithm could be used with some proposal distribution, but it would explore the region of high probability by an inefficient random walk. To overcome this problem, the hybrid Monte Carlo method was proposed for sampling the hyperparameters in Gaussian process regression (Neal 1993, 1996; Rasmussen 1996; Barber and Williams 1997), and we adopt this approach here. The hybrid Monte Carlo method merges the Metropolis–Hastings algorithm with sampling techniques based on dynamics simulation.
To sample the p + q + 2 elements of vector υ from their posterior distribution p(υ|y, θ–υ), we consider a physical system including p + q + 2 particles with the coordinate of the ith particle being υi. The potential energy of this system is defined in such a way that . To allow the use of the dynamic method, we introduce a “momentum” variable, φ, which has p + q + 2 real-valued components, φi, in one-to-one correspondence with the components of υ. The kinetic energy of this system is defined as . Therefore, sampling υ from is equivalent to sampling (υ, φ) from by simply ignoring the momentum φ. The canonical distribution over (υ, φ) is defined to be , where is the “Hamiltonian” function, which gives the total energy of the system. It is well known in physics that the evolutions of such a canonical dynamical system through fictitious time s are governed by the following differential equations:
(9) |
By simulating this dynamical system, the transitions of the Markov chain in the hybrid Monte Carlo method take place as follows:
- Starting from the current state (υ(s), φ(s)), perform L steps on the basis of the discretized Equation 9 with step size ɛ, resulting in the state (υ*, φ*) = (υ(s + Lɛ), φ(s + Lɛ)). A single step from s to s + ɛ can be explicitly written as
(10) (11) (12) With probability min, accept the new state (υ, φ) = (υ*, φ*); otherwise reject the new state and retain the old state with negated momentum (υ, φ) = (υ, −φ).
Update the total energy of the system by perturbing the momenta according to for all i, where zi is drawn randomly from the standard normal distribution. The momentum causes the particle to continue in a consistent direction until a region of high energy (low probability) is encountered. Following Rasmussen (1996), we set ɛ = 0.5n−1/2 and c = 0.95.
SIMULATION AND REAL DATA ANALYSIS
Simulation of multiple-QTL models with or without higher-order interactions:
We simulated a backcross population with 200 individuals and a single chromosome with 151 evenly spaced markers at 5-cM intervals. To investigate the ability of the nonparametric Bayesian multiple-QTL analysis based on (2) to map higher-order interacting QTL that have no main effects, we simulated four interacting QTL without main effects and without lower-order interactions. The four simulated QTL are located at markers 9, 39, 69, and 99, respectively. The simulated function η = η(xi1, …, xi151) = xi9xi39xi69xi99, where the xik, k = 9, 39, 69, 99, are the genotype codes (1 and −1) of the four simulated QTL of individual i and = 1. The total heritability of this model is 50%.
For the analysis, we set αx0 = αt0 = αx1 = αt1 = 1, αξ = αe = 0.5, C = 100, and μξ = μe = 400. We also set axγ = atγ = 0.95 and bxγ = btγ = 0.05, so that the prior probabilities that each variable (QTL or nongenetic covariate) is relevant or irrelevant for the phenotype are 0.05 and 0.95, respectively. Figure 1a provides a plot of the posterior mean estimate of the hyperparameter γxk for each marker k vs. the marker position from the general model (2). As we hoped, the estimates of the hyperparameters associated with the true QTL markers are much larger than the estimates of the hyperparameters associated with the irrelevant markers, and all four, purely interacting QTL were identified on the basis of the marginal posterior probability of inclusion >0.5. Selecting all variables with marginal posterior probability of inclusion >0.5 produces the median probability model that is known to frequently correspond to the optimal predictive model while often differing from the highest probability model.
For comparison, we also ran R/qtlbim (www.qtlbim.org/), a popular software for Bayesian multiple-QTL mapping developed by Yandell et al. (2007). R/qtlbim is an extensible, interactive environment for parametric Bayesian analysis of multiple interacting QTL models for experimental crosses (limited to two-way interactions). The results are summarized in Figure 1b. In the R/qtlbim manual (Banerjee et al. 2008), the following criteria are suggested for judging the significance of QTL: weak support if the Bayes factor (BF) falls between 3 and 10, moderate support if the BF falls between 10 and 30, strong support if the BF > 30, and no support if BF < 3. According to these criteria, R/qtlbim fails to detect any QTL simulated.
To further test the method, we then simulated data sets containing QTL that have only main effects (η(xi1,…, xi151) = 0.25(xi21 + xi51 + xi81 + xi111)) or main and two-way interaction effects (η(xi1, …, xi151) = 0.25 · (xi21 + xi81 + xi21xi51 + xi81xi111)). These are situations that R/qtlbim was specifically designed for. All other simulation parameters remained the same as in the previous simulation. Both models have a total heritability of 20% (∼5% heritability for each QTL). We used the same priors as in the previous simulation for the nonparametric method, and as before we used the default priors of R/qtlbim. Figure 2 summarizes the results for the additive model. Our nonparametric method detects three of the four QTL on the basis of the marginal posterior probability of inclusion (>0.5) and misses one QTL. Similarly, R/qtlbim detects the same three QTL with weak support (3 < BF < 10). For the model with the two-way interactions, results were very similar and are therefore not shown.
Our method and R/qtlbim use different criteria (median inclusion probability vs. BF-based selection) for the selection of a relevant subset of QTL. This difference is confounded with the comparison between the nonparametric and the linear parametric method in terms of their ability to detect existing QTL correctly. To overcome this problem, we varied the cutoffs imposed on the inclusion probability and BF, respectively, for declaring the significance of QTL, and we generated receiver operating characteristic (ROC) curves. For each scenario (four-way interaction, additive, and additive plus two-way interaction models as above), we ran 100 simulations. Instead of fixing the positions of the four simulated QTL, we uniformly generated their positions subject to the restriction that any pair of QTL had to be at least 10 cM apart. We divided the whole genome into nonoverlapping 10-cM-wide intervals. For a given cutoff (on inclusion probability or BF), a significant interval was defined as an interval that contains at least one marker whose significance measure exceeds the cutoff. A significant interval is defined as a true positive if it includes one of the simulated QTL. Otherwise, it is called a false positive. We defined true positive rate = (no. of significant, true intervals)/(no. of significant intervals) and false positive rate = (no. of significant, false intervals)/(no. of significant intervals). The ROC curves up to a false positive rate of 0.1 are given in Figure 3 for all three models simulated. For the four-way interaction model, our nonparametric method performed much better than R/qtlbim, which essentially failed to detect any QTL. It is interesting to see that our method appears to perform essentially as well as R/qtlbim for the model with both main and two-way interactions. It is even more interesting to find that our method is superior to R/qtlbim for the main effects model. This is because we ran R/qtlbim by searching for both main effects and two-way interactions simultaneously, even when analyzing the data generated under the pure main effects (additive) model.
Real data analysis:
In addition to the simulation, we tested our method on a real mouse study on obesity, a major risk factor for type II diabetes. To genetically dissect a polygenic mouse model of obesity-driven type II diabetes, Reifsnyder et al. (2000) outcrossed the obese, diabetes-prone, New Zealand obese (NZO)/HlLt strain to the relatively lean nonobese nondiabetic (NON)/Lt strain and then reciprocally backcrossed obese F1 mice to the lean NON/Lt parental strain. They measured the body weights of 187 backcross males. In addition, inguinal, gonadal, retroperitoneal, and mesenteric fat pad weights were also measured. Stylianou et al. (2006) studied the fat pad weights using F2 progeny between the SM/J and NZB/BINJ inbred mouse strains. They identified several QTL associated with the gonadal fat pad weight after adjusting for the total lean body weight (LBWT). Following Stylianou et al. (2006), we first calculated the total fat pad weight as the mesenteric fat pad weight plus twice the sum of the inguinal, gonadal, and retroperitoneal fat pad weights. Then the LBWT was obtained as the difference between the total body weight and the total weight of the fat pads. We applied our nonparametric Bayesian variable selection method to the Reifsnyder et al. (2000) data. The results are presented in Figure 4. Clearly, 2 among the 86 predictors (85 markers plus the continuous covariate LBWT) are selected. The first ranked predictor is the covariate LBWT, and the second ranked predictor is marker D4Mit311 located on chromosome 4. Figure 4 strongly indicates a QTL on chromosome 4 while other QTL in the genome are (much) less likely. Further studies based on this observation can be done by investigating the relationship between the phenotype and these two variables in more detail. For each genotype of the QTL identified on chromosome 4, we estimated the weight curve function on LBWT, and the results are reported in Figure 5. From the two estimated curves, there is no clear evidence for an interaction between the QTL and LBWT.
DISCUSSION
In this article, we have proposed a novel nonparametric QTL mapping method where the genetic as well as nongenetic factors are modeled via a function η, whose form is unspecified. The advantage of our approach is that it models all potential genetic and nongenetic effects, including main effects and all interaction effects of any order, nonexplicitly. It determines only which of the genetic and nongenetic factors are important, on their own through main effects and/or in combination with other factors. This was achieved by combining the Gaussian process prior for the unknown function with variable selection. Although in this article we assumed that all putative QTL are located at the marker positions, it is straightforward to extend the method to consider any candidate QTL in between marker positions as in Wang et al. (2005) and Huang et al. (2010). A similar nonparametric variable selection procedure has been proposed for computer experiments by Linkletter et al. (2006). These authors mainly focused on identifying active factors having nonlinear relationships with the response variable. However, mapping multiple interacting QTL is our main purpose, and our article appears to be the first one to propose modeling the joint action of multiple QTL with an unknown function having a Gaussian process prior, which accommodates any multiway interactions. Moreover, Linkletter et al. (2006) consider only a relatively small (<50) number of continuous covariates while in our article and in QTL linkage and association mapping in general, there are a large number of discrete marker covariates (hundreds or thousands) in addition to a small number of environmental, continuous covariates or discrete factors. Therefore, an efficient sampling scheme, such as the hybrid MCMC described in this article, is essential for dealing with these large-scale data sets.
While the linear parametric method in R/qtlbim may have little or no power to detect QTL acting through higher-order interactions, computationally it is fast, and it can handle large numbers of individuals and markers. We do not recommend replacing the linear parametric analysis with the nonparametric method, but rather using it as an additional or preliminary tool to screen the genome for QTL acting through higher-order interactions, which existing QTL mapping methods fail to detect. Once important factors have been identified with the nonparametric method, they then can and should be further analyzed with a detailed parametric model to elucidate the mode of action of the identified QTL (and environmental factors). Application of a detailed parametric method on a genome-wide scale to search for all possible main and interaction effects would dramatically increase the multiple-testing problem, in addition to the computational burden, while the nonparametric method can identify all these effects with a single parameter per candidate QTL (and environmental factor).
Our current research focuses on further improving the computational feasibility of our nonparametric method. Our current implementation of the Bayesian Gaussian process prior method, with the mixture priors on the variable selection parameters (ρ's) and using the hybrid Monte Carlo method, allows us to analyze data sets with up to several hundred individuals and several hundred markers, in hours rather than in minutes as with R/qtlbim. A major reason for this increase in computing time is the need to compute the inverse of an n × n matrix in each MCMC cycle to sample υ. This is particularly a problem for genome-wide association studies (GWAS), for which our nonparametric method is also potentially useful. GWAS typically require a larger sample size than linkage studies (in the order of thousands or tens of thousands) and several hundred thousand markers (tag SNPs). Further, in this article, we propose a simple Gibbs sampler for the latent binary variables that code for inclusion of a marker in the covariance function. For QTL mapping with only hundreds of markers, this algorithm works well. For very large p, it is likely that the algorithm may not properly mix over the huge sample space, a legitimate concern when we apply the method directly to GWAS data where hundreds of thousands of SNPs are available. Berger and Molina (2005) propose an approach to search for important models through large model space without visiting every model. Their approach provides a nice alternative, which deserves further investigation for GWAS data. Besides alternative sampling schemes we are currently investigating shrinkage priors to replace the mixture priors and increase computational efficiency, and we are exploring deterministic algorithms to replace MCMC sampling, in particular a conjugate gradient optimization technique to compute the maximum a posteriori estimates of the parameters (Rasmussen 1996). A genome-wide data set may first be analyzed with the deterministic implementation to screen out many variables (predictors) that are clearly irrelevant. Then the selected, promising subset of predictors (markers, genomic regions) may be reanalyzed by full MCMC, which provides much more information than a deterministic mode-finding algorithm. With an initial implementation of shrinkage priors and the conjugate gradient optimization technique we have been able to analyze a data set in a candidate gene association study with ∼900 participants and 2500 tag SNPs.
Selection of a subset of QTL can be performed on the basis of the estimated marginal posterior probabilities of inclusion with cutoff determined using the median probability model or Bayesian false discovery rate. Alternatively, we may add pseudonull variable(s) into the model and use the posterior distribution of their γ's to guide the variable selection. Linkletter et al. (2006) suggested adding a single pseudonull variable but running the analysis many times (say 100). For computational reasons, this approach works for their smaller size problems but is computationally very demanding or infeasible in the QTL mapping context. Furthermore, adding a single pseudonull variable would not work (well) for QTL mapping because marker (null) variables are correlated due to linkage. Wu et al. (2007) proposed a similar idea for variable selection in linear regression models using a set of pseudonull variables. Their method requires no additional repeated analysis as in Linkletter et al. (2006) and can also incorporate the linkage structure of the observed markers into the generation of the pseudonull variables. We are planning to extend the method of Wu et al. (2007) to our Gaussian process-based QTL selection methodology.
Much work has been done recently on sparse signal detection in (generalized) linear regression models, where there are two groups of sparsity priors, shrinkage or one-group priors, and mixture or two (multiple) group priors. Here we have employed a mixture prior for the parameters related to variable selection. Our current and future work focuses on further studies and modifications of this mixture prior and of alternative shrinkage priors. The goal of our present article was to convincingly demonstrate that the nonparametric Bayesian analysis based on the Gaussian process prior is indeed able to detect QTL irrespectively of whether they act on the trait of interest through main effects, any order of interaction among QTL, or interactions of QTL with environmental factors.
Acknowledgments
The authors thank the editor, the associate editor, and referees for their helpful comments and suggestions, which led to a great improvement of this article. This research was partially supported by National Institutes of Health grant GM074175.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.113688/DC1.
References
- Abrahamsen, P., 1997. A review of Gaussian random fields and correlation functions. Technical Report 917. Norwegian Computing Center, Oslo.
- Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19 716–723. [Google Scholar]
- Banerjee, S., B. S. Yandell, W. W. Neely and N. Yi, 2008. QTL Analysis Using Bayesian Interval Mapping. University of Birmingham, Birmingham, AL. http://www.ssg.uab.edu/qtlbim/assets/docs/qtlbim.overview.pdf
- Barber, D., and C. K. I. Williams, 1997. Gaussian processes for Bayesian classification via hybrid Monte Carlo, pp. 340–346 in Advances in Neural Information Processing Systems 9, edited by M. C. Mozer, M. I. Jordan and T. Petsche. MIT Press, Cambridge, MA.
- Berger, J. O., and G. Molina, 2005. Posterior model probabilities via path-based pairwise priors. Stat. Neerl. 59 3–15. [Google Scholar]
- Broman, K. W., and T. P. Speed, 2002. A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion). J. R. Stat. Soc. Ser. B 64 641–656, 737–775. [Google Scholar]
- Chen, H., 1988. Convergence rates for parametric components in a partly linear model. Ann. Stat. 16 136–146. [Google Scholar]
- Cuzick, J., 1992. Semiparametric additive regression. J. R. Stat. Soc. Ser. B 54 831–843. [Google Scholar]
- Doerge, R. W., Z.-B. Zeng and B. S. Weir, 1997. Statistical issues in the search for genes affecting quantitative traits in experimental populations. Stat. Sci. 12 195–219. [Google Scholar]
- George, E. I., and R. E. McCulloch, 1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88 881–889. [Google Scholar]
- Godsill, S. J., 2001. On the relationship between Markov chain Monte Carlo methods for model uncertainty. J. Comput. Graph. Stat. 10 230–248. [Google Scholar]
- Godsill, S. J., 2003. Proposal Densities, and Product Space Methods, in Highly Structured Stochastic Systems. Oxford University Press, London/New York/Oxford.
- Green, P. J., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732. [Google Scholar]
- Hastie, T. J., and C. Loader, 1993. Local regression: automatic kernel carpentry (with discussion). Stat. Sci. 8 120–143. [Google Scholar]
- Heckman, N., 1986. Spline smoothing in a partly linear model. J. R. Stat. Soc. Ser. B 48 244–248. [Google Scholar]
- Huang, H., H. Zhou, F. Cheng, I. Hoeschele and F. Zou, 2010. Gaussian process based Bayesian semiparametric quantitative trait loci interval mapping. Biometrics 66 222–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jansen, R. C., and P. Stam, 1994. High resolution of quantitative traits into multiple quantitative trait in line crosses using flanking markers. Heredity 69 315–324. [Google Scholar]
- Kwee, L. C., D. Liu, X. Lin, D. Ghosh and M. P. Epstein, 2008. A powerful and flexible multilocus association test for quantitative traits. Am. J. Hum. Genet. 82 386–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lander, E., and D. Botstein, 1989. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121 185–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linkletter, C., D. Bingham, N. Hengartner, D. Higdon and K. Q. Ye, 2006. Variable selection for Gaussian process models in computer experiments. Technometrics 48 478–490. [Google Scholar]
- MacKay, D. J., 1998. Introduction to Gaussian processes, pp. 133–166 in Neural Networks and Machine Learning (NATO Asi Series, Vol. 168. Series F, Computer and Systems Sciences), edited by C. M. Bishop. Springer-Verlag, Berlin/Heidelberg, Germany/New York.
- Marchini, J., P. Donnelly and L. R. Cardon, 2005. Genome-wide strategies for detecting multiple loci influencing complex diseases. Nat. Genet. 37 413–417. [DOI] [PubMed] [Google Scholar]
- Menzefricke, U., 2000. Hierarchical modeling with Gaussian processes. Commun. Stat. 29 1089–1108. [Google Scholar]
- Müller, P., G. Parmigiani and K. Rice, 2006. FDR and Bayesian Multiple Comparisons Rules (Working Paper 115). Department of Biostatistics Working Papers, Johns Hopkins University, Baltimore. http://www.bepress.com/jhubiostat/paper115
- Neal, R. M., 1993. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93–1. Department of Computer Science, University of Toronto, Toronto.
- Neal, R. M., 1996. Bayesian Learning for Neural Networks. Springer-Verlag, New York.
- Neal, R. M., 1997. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report No. 9702. Department of Statistics, University of Toronto, Toronto.
- O'hagan, A. 1978. On curve fitting and optimal design for regression. J. R. Stat. Soc. B 40 1–42. [Google Scholar]
- Park, T., and G. Casella, 2008. The Bayesian lasso. J. Am. Stat. Assoc. 103 681–686. [Google Scholar]
- Rasmussen, C. E., 1996. Evaluation of Gaussian processes and other methods for non-linear regression. Ph.D. Thesis, University of Toronto, Toronto.
- Reifsnyder, P. C., G. A. Churchill and E. H. Leiter, 2000. Maternal environment and genotype interact to establish diabesity in mice. Genome Res. 10 1568–1578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwarz, G., 1978. Estimating the dimension of a model. Ann. Stat. 6 461–464. [Google Scholar]
- Speckman, P., 1988. Kernel smoothing in partial linear models. J. R. Stat. Soc. B 50 413–436. [Google Scholar]
- Stylianou, I. M., R. Korstanje, R. Li, S. Sheehan, B. Paigen et al., 2006. Quantitative trait locus analysis for obesity reveals multiple networks of interacting loci. Mamm. Genome 17 22–36. [DOI] [PubMed] [Google Scholar]
- Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58 267–288. [Google Scholar]
- Wahba, G., 1984. Cross validated spline methods for the estimation of multivariate functions from data on functionals, pp. 205–235 in Statistics, an Appraisal, Proceedings of the Iowa State University Statistical Laboratory 50th Anniversary Conference, edited by H. A. David and H. T. David. Iowa State University Press, Ames, IA.
- Wang, H., Y. M. Zhang, X. Li, G. L. Masinde, S. Mohan et al., 2005. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170 465–480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams, C. K. I., and D. Barber, 1998. Bayesian classification with Gaussian processes. IEEE Trans. Patt. Anal. Mach. Intell. 20 1342–1351. [Google Scholar]
- Wu, Y., D. D. Boos and L. A. Stefanski, 2007. Controlling variable selection by the addition of pseudovariables. J. Am. Stat. Assoc. 102 235–243. [Google Scholar]
- Yandell, B. S., T. Mehta, S. Banerjee, D. Shriner, R. Venkataraman et al., 2007. R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 23 641–643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi, N., 2004. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167 967–975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi, N., and S. Xu, 2008. Bayesian LASSO for QTL mapping. Genetics 179 1045–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi, N., D. Shriner, S. Banerjee, T. Mehta, D. Pomp et al., 2007. An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics 176 1865–1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng, Z. B., 1994. Precision mapping of quantitative traits loci. Genetics 136 1457–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]