Abstract
We develop a novel Bayesian method to select important predictors in regression models with multiple responses of diverse types. A sparse Gaussian copula regression model is used to account for the multivariate dependencies between any combination of discrete and/or continuous responses and their association with a set of predictors. We utilize the parameter expansion for data augmentation strategy to construct a Markov chain Monte Carlo algorithm for the estimation of the parameters and the latent variables of the model. Based on a centered parametrization of the Gaussian latent variables, we design a fixed-dimensional proposal distribution to update jointly the latent binary vectors of important predictors and the corresponding non-zero regression coefficients. For Gaussian responses and for outcomes that can be modeled as a dependent version of a Gaussian response, this proposal leads to a Metropolis-Hastings step that allows an efficient exploration of the predictors’ model space. The proposed strategy is tested on simulated data and applied to real data sets in which the responses consist of low-intensity counts, binary, ordinal and continuous variables.
Keywords: Gaussian copula, Mixed data, Multiple-response regression models, Sparse co-variance matrix, Variable selection
1. Introduction
The identification of important predictors in linear and non-linear regression models is one of the most frequently studied questions in statistical theory. In Bayesian statistics, this problem is known as Bayesian variable selection (BVS) and for Gaussian responses, there is an extensive literature for an efficient detection of important predictors in both single- and multi-response regression models. In the case of a single response, a list of relevant papers includes, but is not limited to, George and McCulloch (1997), Liang et al. (2008), Guan and Stephens (2011) and Ročková and George (2018), while Brown et al. (1998), Holmes et al. (2002) deal with the same problem in the multi-response case. BVS in single-response non-linear regression models has also received great attention. In Dellaportas et al. (2002) and Forster et al. (2012) the corresponding methods are reviewed and advances are proposed.
More recently, there has been an increasing interest in the joint analysis of outcomes of diverse types, for instance, continuous, binary, categorical and count data, given their availability from studies involving multivariate data, see for example Hoff (2007), Murray et al. (2013), Zhang et al. (2015) and Bhadra et al. (2018). In regression analysis, the most popular model used to account for the multivariate dependencies between any combination of discrete and/or continuous responses is the Gaussian copula regression (GCR) model (Song et al., 2009) in which each response is associated with a (potentially different) set of predictors. When only Gaussian responses are considered, this is known as Seemingly Un-related Regression (SUR) model (Zellner, 1962). Recent contributions for sparse Bayesian SUR models include, for instance, Wang (2010) and Deshpande et al. (2019). Bayesian methods for the estimation of the regression coefficients of the GCR model with a fixed set of predictors have also been proposed (Pitt et al., 2006). Despite the growing Bayesian literature regarding an efficient selection of important predictors, to the best of our knowledge, variable selection for the GCR model has not been attempted. We propose here the first fully Bayesian approach for model selection in regression models with multiple responses of diverse types.
The main obstacle of the application of BVS in single-response non-linear models, as well as in the SUR model and the GCR model with responses of diverse types, is the non-tractability of the marginal likelihood. To overcome this problem, Markov chain Monte Carlo (MCMC) algorithms are based either on Laplace approximation, see for example Bové and Held (2011), or on a Metropolis-Hastings (M-H) step in which the dimension of the proposal distribution is not fixed at each iteration (Forster et al., 2012). The latter is an application of the Reversible Jump algorithm (Green, 1995) which is known to experience a low acceptance rate when the trans-dimension proposal distribution is not devised carefully, resulting in MCMC samplers with poor mixing (Brooks et al., 2003). In addition, in current applications, the number of predictors is often very large and any MCMC algorithm for BVS in both linear and non-linear regression has to be designed carefully in order to explore successfully the ultra-high dimensional model space consisting of all the possible subsets of predictors (Bottolo and Richardson, 2010; Lamnisos et al., 2009).
The main contribution of this paper is the development of a Bayesian approach for the joint update of the latent binary vector of important predictors and the corresponding vector of non-zero regression coefficients for each response of the GCR model. By using the proposed strategy, we perform BVS without any approximation. We also avoid the Reversible Jump algorithm by utilizing the Gaussian latent variables of the GCR model to construct a proposal distribution defined on a fixed-dimensional space. For Gaussian responses and for the outcomes that can be modeled as a dependent version of a Gaussian response, e.g., the Probit model for binary data, the designed proposal distribution allows the “implicit marginalisation” of the non-zero regression coefficients in the M-H acceptance probability (Holmes and Held, 2006), favoring the exploration of the predictors model space in a more efficient manner.
Modeling the dependence amongst the responses is another key aspect in GCR. Until recently devising an efficient MCMC algorithm for a structured (constrained) covariance matrix that includes the identifiability conditions for the non-linear responses has been a difficult task. Here, we follow the solution proposed by Talhouk et al. (2012) and specify a conjugate prior on the correlation matrix. We utilize the parameter expansion for data augmentation (Liu and Wu, 1999; Van Dyk and Meng, 2001) to expand the correlation into a covariance matrix. We also adopt the idea of covariance selection to obtain a parsimonious representation of the dependence amongst the responses. Based on the theory of decomposable Gaussian graphical models (Lauritzen, 1996), we use the hyper-inverse Wishart distribution as the prior density for the covariance matrix. This prior allows some of the off-diagonal elements of the inverse covariance matrix to be identical to zero and to estimate the conditional dependence pattern of the observations (Webb and Forster, 2008).
We tested the performance of our model, Bayesian Variable Selection for Gaussian Copula Regression (BVSGCR), in a comprehensive simulation study and compared the performance of our new approach with conventional Bayesian methods for the selection of important predictors in single-response (linear and non-linear) regression models, see Holmes and Held (2006), Frühwirth-Schnatter et al. (2009) and Dvorzak and Wagner (2016).
We also applied the proposed method to two real data sets. The first data set includes a combination of nine continuous, binary and ordered categorical responses. These responses are phenotypic traits of a rare disorder called Ataxia-Telangiectasia. We analyzed one of the largest cohorts of patients, consisting of 46 individuals affected by the disease (Schon et al., 2019). Our model borrows information across the outcomes in order to identify important associations between the responses and a set of genetic and immunological predictors that have been collected in the same study. The second data set consists of four counts and one ordered categorical response which are measured in 122 individuals suffering from Temporal Lobe Epilepsy (Johnson et al., 2016). Our interest lies in the identification of associations between the responses and 162 correlated genes that have been identified in a recent gene-network analysis related to cognition abilities and epilepsy (Johnson et al., 2016). Finally, in both real data sets, we compared the predictive ability of the BVSGCR model with widely-used single-response linear and non-linear sparse regression models.
The rest of the paper is organized as follows. In Section 2 we provide a brief presentation of the GCR model and the prior distributions on the regression coefficients and the correlation structure. In Section 3 we describe the novel MCMC algorithm that we propose for BVS when a combination of discrete and/or continuous responses are considered. Section 4 presents the results of the simulation study and in Section 5 we apply the proposed model on two real data sets with missing values in the outcome variables which led to a straightforward modification of the designed MCMC algorithm. Finally, in Section 6 we conclude with a short discussion.
2. Gaussian copula regression model
In the following, all vectors, in bold font, are understood as column vectors and the superscript “T”is used to denote the transpose of a vector or a matrix. Matrices are also indicated in bold font. The lower-case notation will be reserved for the observations with the corresponding random variables in capital letters.
2.1. Gaussian copulas
An m-variate function C(u1, …, um), where C : [0, 1]m → [0, 1], is called a copula if it is a continuous distribution function and each marginal is a uniform distribution function on [0, 1]. Sklar (1959) proved that any joint cumulative distribution function (cdf) of continuous random variables can be completely specified by its marginal distributions and a unique copula C. If F1(·), …, Fm(·) are the marginal cdfs of a combination of m continuous and discrete random variables Y1, …, Ym, their joint cdf can be specified through a specific copula function C as
A copula function which is commonly used for modeling the dependence structure of any combinations of continuous and discrete variables is the Gaussian copula, see for example Hoff (2007) and Murray et al. (2013). The Gaussian copula C is specified through the function
| (1) |
where Φm(·; R) is the cdf of an m-variate Gaussian distribution with zero mean vector and correlation matrix R and Φ−1(·) is the inverse of the univariate standard normal cdf. Thus, taking in eq. (1) , for each k = 1, …, m, we specify the cdf of Y = (Y1, …, Ym) to be the Gaussian copula function. Song (2000) proves that the density of the Gaussian copula is
| (2) |
where is an m-dimensional vector and , k = 1, …, m, is known as the normal score which follows marginally the standard Gaussian distribution. Then, is an m-variate Gaussian distribution with zero mean vector and correlation matrix R, i.e., . If all the marginal distributions are continuous, the matrix R can be interpreted as the correlation matrix of the elements of Y and zeros in its inverse imply the conditional independence among the corresponding elements of Y. However, in the presence of discrete random variables, the notion of conditional independence has to be interpreted with care; zeros in R−1 imply that the observed variables are independent conditionally only on the latent variables (Webb and Forster, 2008). Note also that if R is the identity matrix the elements of Y can be considered to be independent despite the presence of discrete variables, see Song (2000), Song et al. (2009) and Talhouk et al. (2012) for a detailed discussion.
2.2. Regression model
Let be the (n × m)-dimensional matrix of observed data where each yi = (yi1, …, yim) consists of any combination of m discrete and/or continuous responses. Moreover, let xik be the pk-dimensional vector of predictors for the ith sample and the kth response. To model the marginal distribution of each yik, we specify its cdf Fk(yik; xik, βk, θk) to be the cdf of any parametric distribution. Our notation emphasizes the dependence of each yik on the pk-dimensional response-specific vector of predictors xik, the associated regression coefficients βk and the response-specific parameters θk.
The GCR model is described by the transformation
| (3) |
where are realizations from and for each i = 1, …, n and k = 1, …, m.
By assuming that each Fk(·) is a member of the exponential family, we obtain the multivariate Generalized Linear Model presented in Song et al. (2009) as the multivariate extension of the well-known single-response Generalized Linear Model (McCullagh and Nelder, 1989) with θk the specific vector of parameters for the kth response. The SUR model is a special case of eq. (3) when all margins are univariate Gaussian with mean and variance θk and R is the correlation matrix. The multi-response Probit regression model of Chib and Greenberg (1998) is obtained from eq. (3) by specifying each margin to be the cdf of a Bernoulli random variable with probability of success . Finally, setting , c = 1, …, Ck−1, with Ck the number of categories, eq. (3) becomes a regression model for a combination of binary and ordinal observations and consists of the cut-points for the kth ordinal observation (McCullagh, 1980). In both multi-response Probit and ordinal regression models the matrix R is in the correlation form for identifiability conditions (Chib and Greenberg, 1998).
Eq. (3) implies that the joint likelihood function of the observations Y conditional on the correlation matrix R, the regression coefficients B = vec(β1, …, βm) and the parameters Θ = vec(θ1, …, θm) is an intractable function of non-computable high-dimensional integrals (Song et al., 2009).
2.3. Prior distributions
In this section, we specify the prior distributions on the regression coefficients B, the associated sparsity prior on the inclusion probability and the correlation matrix R of the model. By noting that for different choices of the marginal cdfs Fk(·), k = 1, …, m, we have different vectors of parameters θk, we will assign their prior distributions differently in each simulated and real data example. See Supp. Mat. Section S.1 for details.
2.3.1. Variable selection
We use a hierarchical non-conjugate model to assign a prior distribution on the regression coefficients of the GCR model defined in eq. (3). A point mass at zero is specified on the regression coefficients of the unimportant predictors, whereas a Gaussian distribution is assigned to the non-zero regression coefficients (George and McCulloch, 1997). By utilizing the binary latent vector , k = 1, …, m, where γkj is 1 if, for the jth predictor and the kth response, the regression coefficient is different from zero and 0 otherwise, we assume that for each k
| (4) |
| (5) |
where δ0 denotes a point mass at zero and υ is a fixed value. It is common practice to standardize the predictor variables, taking υ = 1 in order to place appropriate prior mass on reasonable values of the non-zero regression coefficients (Hans et al., 2007). Integrating out πk, it is readily shown that marginally
| (6) |
where . The hyper-parameters ak and bk can be chosen using prior information about the number of important covariates associated with the kth response and its variance. See (Kohn et al., 2001) for further details on the elicitation of the hyper-parameters ak and bk in eq. (5).
The sparsity prior (5) is a common choice in single-response BVS as well as in the sparse SUR model (Wang, 2010). When the predictors are common to all responses, the primary inferential question may shift to the identification of key predictors that exert their influence to a large fraction of responses at the same time. When a large number of responses are regressed independently on the same set of predictors, Richardson et al. (2010) proposed to decompose the a priori “cell” inclusion probability into its marginal effects, i.e., πkj = πk × ρj, πkj ∈ [0, 1]. The idea behind this decomposition is to control the level of sparsity for each response k through a suitable choice of the hyperparameters ak and bk as in (5), while ρj ∼ Ga(c, d) captures the relative “propensity” of predictor j to influence several responses at the same time. For a detailed discussion regarding the effect of the prior p(ρj) on BVS in multiple-response regression models, see Ruffieux et al. (2020).
In the following, for ease of notation, we denote by all non-zero elements of βk and analogously, we indicate by , the elements of xik corresponding to those elements of γk equal to 1. Let Xk be the n × pk matrix of all available predictors for the kth response. Similarly to the vector case, denotes the n × |γk| matrix in which the columns are selected according to the latent binary vector γk. Finally, we refer to Γ = (γ1, …, γm) as the latent binary matrix.
2.3.2. Correlation matrix R and adjacency matrix G
To specify a prior distribution on the correlation matrix R, we follow Talhouk et al. (2012) and use the parameter expansion for data augmentation strategy (Liu and Wu, 1999) to expand the correlation matrix into a covariance matrix. This choice allows us to specify conjugate prior distributions on the resulting covariance matrix. In particular, we utilize the hyper-inverse Wishart (Dawid and Lauritzen, 1993) to allow for zero entries in the inverse of the covariance matrix. Details can be summarized as follows.
First, to expand R into a covariance matrix, we define the transformation , where is the n × m matrix of the Gaussian latent variables and D is an m × m diagonal matrix with elements δk, k = 1, …, m. Then,
| (7) |
where Σ = DRD and In is a diagonal matrix of dimension n that encodes the independence assumption amongst the observations. Then, a conjugate prior distribution can be assigned on Σ and updated at each iteration of the MCMC algorithm before it is projected back to R using the inverse transformation R = D−1ΣD−1. We also utilize the theory of decomposable models to perform a conjugate analysis of the covariance structure of the model since the hyper-inverse Wishart distribution is a conjugate prior distribution for the covariance matrix Σ with respect to the adjacency matrix G of a decomposable graph G. The diagonal elements in the adjacency matrix are always restricted to be 1 to ensure the positive definiteness of G.
Second, we assign the following prior structure on G and D. Let , be the binary indicator for the presence of the th off-diagonal edge in the lower triangular part of the adjacency matrix G of the decomposable graph . We assume that
| (8) |
For a detailed discussion regarding compatible priors on G for decomposable graphs, see Bornn et al. (2011).
We denote by p(G) the induced marginal prior on the adjacency matrix and define the joint distribution of D, R and G as
where
| (9) |
with rkk the kth diagonal element of R−1. Assuming a uniform prior for R|G, it can be shown that, for a decomposable graph , (Talhouk et al., 2012).
3. MCMC sampling strategy
We are interested in sampling from the joint posterior distribution p(B, Γ, Θ, , D, R, G|Y). To draw samples from the specified model, we design a novel MCMC algorithm which proceeds as follows. We first update the regression coefficients B, the latent binary matrix Γ, the parameters Θ and the Gaussian latent variables in m blocks. In particular, for each k = 1, …, m, we sample jointly (βk, γk), then we update θk and, finally, we draw realizations from the Gaussian latent variables . In the last step, we draw the correlation matrix R and the adjacency matrix G from their full conditional distributions.
Algorithm 1 presents the pseudo code of the designed MCMC algorithm where the subscript “−k” implies that the corresponding matrix (or vector) consists of all the elements except those that are related to the kth response.
For the large majority of responses, drawing (βk, γk) in step 4 is complicated due to the unavailability of the full conditional distributions in closed-form expression and a M-H step is therefore required. However, because of the uncertainty associated with the latent binary vector γk, the proposal distribution requires a different dimension at each MCMC iteration. In this framework, a commonly used tool is the Reversible Jump algorithm (Green, 1995), although it may experience a low acceptance rate, resulting in a MCMC sampler with poor mixing (Brooks et al., 2003; Lamnisos et al., 2009).
Algorithm 1 MCMC algorithm for sampling from the joint distribution p(B, Γ, Θ, , D, R, G|Y).
1: Set the number of iterations S.
2: for s = 1, …, S do
3: for k = 1, …, m do
4: Sample from p(βk, γk|yk, θk, , D, R)
5: Sample from p(θk|yk, βk, γk, , D, R)
6: Sample from p(, βk, γk, θk, , D, R)
end for
8: Sample from p(D, R|Y, , G)
9: Sample from p(G|Y, , D, R)
10: end for
To avoid the Reversible Jump algorithm, we design a proposal distribution defined on a fixed-dimensional space that can be used for the joint update of βk and γk. It is worth noticing that we conduct the update of (βk, γk) by first integrating out the latent variables . This accelerates the convergence of the proposed MCMC algorithm since conditioning on induces many restrictions on the admissible values of (βk, γk), see Pitt et al. (2006). Drawing samples from the posterior distributions of the remaining parameters and the Gaussian latent variables of the model can be conducted by using standard MCMC algorithms which we also describe briefly in this section.
3.1. Proposal distribution for variable selection
To sample jointly the regression coefficients βk and the latent binary vector γk for each k = 1, …, m (step 4 Algorithm 1), we design a M-H step that targets the distribution with density
| (10) |
where p(βk|γk) and p(γk) are the prior densities on βk and γk as defined in eq. (4) and eq. (6), respectively.
In the case of a continuous response yk, we have from eq. (2) that
| (11) |
where we condition on with an abuse of notation (Pitt et al., 2006) since is a shorter expression of the transformation in eq. (3), and denotes the probability density function of .
If yk is a discrete response, we have that
| (12) |
where and with and . Thus, the conditional density in eq. (10) will be intractable for the majority of the continuous and for all the discrete distributions that can be used for the marginal modeling of the kth response.
Instead of relying on the Reversible Jump algorithm or the Laplace approximation to sample from eq. (10), we utilize a M-H step with a fixed-dimensional proposal distribution obtained by re-parameterizing the Gaussian latent variables . More precisely, for each i = 1, …, n and k = 1, …, m, we set
| (13) |
where Zik is marginally normal distributed with mean and variance either (if continuous) or (if discrete). Thus, the vector of realizations zk is a sufficient statistics for (βk, γk), whereas is ancillary (Yu and Meng, 2011). Irrespectively of the type of the response, eq. (13) creates a link between the Gaussian Copula model, where the Gaussian latent variables are non-centered, and BVS that requires a centered parametrization. With this transformation, we also aim to keep the advantages of using (data augmentation) centered auxiliary variables when performing BVS (Holmes and Held, 2006). In particular, the fact that the vector zk retains information about the likelihood is key since, regardless of the response’s type, it allows an effective update of βk given a change in the predictors set.
To construct a proposal distribution for a M-H step that targets eq. (10), we replace the intractable density p(yk|βk, γk, θk, , R) with the Gaussian density of the latent variable Zk conditioned on . Exploiting the fact that βk is quadratic in p(zk|βk, γk, , D, R), given a candidate value of the latent binary vector , we propose the non-zero regression coefficients from the distribution with density
| (14) |
where is the diagonal element of the conditional covariance matrix of and is the conditional mean of . The proposal distribution in eq. (14) shows two important features: (i) in the covariance matrix , the term is rescaled by the conditional variance and (ii) the proposal mean is shifted by the rescaled conditional mean , where the factor σk brings zk and on the same scale. Thus, the realizations zk of the Gaussian latent variables in eq. (14) are recentered to remove the effect of conditioning on , i.e., , see Supp. Mat. Section S.3 for details.
Remark 1
If R is the identity matrix, then the proposal distribution (14) becomes
| (15) |
since and the conditional mean is equal to zero. When no information is shared across responses, the proposal distribution in eq. (15) coincides with the full conditional distribution of the non-zero regression coefficients used in single-response Probit regression model (Albert and Chib, 1993; Holmes and Held, 2006). It is worth noticing that, regardless of the similarities with the Probit model, this proposal density has been obtained as a special case of eq. (14).
Remark 2
For some continuous and discrete data including the Gaussian, binary, ordered and nominal categorical responses, the proposal density in eq. (14) allows the “implicit marginalisation” of the regression coefficients when the joint update of (βk, γk) in the M-H step is performed. For this type of responses, the acceptance probability α of the joint move is
| (16) |
where . Similarly to Holmes and Held (2006), in eq. (16) the current and proposed value of the regression coefficients (βk, ) do not appear. For details, see Supp. Mat. Section S.4.
Finally, since the main focus of this paper is to provide a proposal distribution for the regression coefficients given the latent binary vector, any proposal distribution for can be used. Here, we use a modified version of the proposal distribution designed by Guan and Stephens (2011). This proposal makes efficient use of the inexpensive evaluation of the marginal association between each response and the predictors in order to propose a new latent vector . See Supp. Mat. Section S.2 for a detailed description.
3.2. Sampling the Gaussian latent variables
Next, we describe the sampling strategy for the Gaussian latent variables (step 6 Algorithm 1). If the kth response is continuous then the transformation (3) is a one-to-one transformation. In this case is updated deterministically by setting for each i = 1, …, n and k = 1, …, m. For a discrete response yk, we have that
| (17) |
where , and uik are defined in eq. (12). Therefore, each has to be sampled from the Gaussian distribution truncated on the interval .
3.3. Sampling the matrix D, the correlation matrix R and the adjacency matrix G
To update (D, R) and G in the designed MCMC Algorithm 1 (step 8 and 9), we follow Talhouk et al. (2012) and work in the space of the scaled Gaussian latent variables . In practice, we obtain samples from and transform them using the inverse transformation R = D−1ΣD−1 where D is sampled from its prior distribution in eq. (9). To sample the adjacency matrix G from p(G|Y, , D, R), we target the distribution with density
| (18) |
where the summation in the denominator is over all the decomposable graphs , p(G) is defined through eq. (8) and can be computed analytically due to the tractability of the hyper-inverse Wishart distribution, see Supp. Mat. Section S.5 for further details. We sample from eq. (18) by using a M-H step in which, conditionally on the current adjacency matrix G, a new graph is proposed by adding or deleting an edge between two vertices whose index has been chosen randomly between the vertices that belong to a decomposable graph. The proposed graph is then accepted or rejected using the accept/reject mechanism of the M-H step which targets the density in eq. (18). Finally, conditionally on and the updated adjacency matrix G, we sample Σ from its conditional distribution .
Remark 3
The choice of a decomposable graphical model can be relaxed to include non-decomposable graphs although at a higher computational cost. Within our framework, posterior samples of the adjacency matrix can be obtained by using the BDgraph algorithm proposed by Mohammadi and Wit (2019) directly on the space of the scaled Gaussian latent variables where the problem of the intractable normalizing constants appearing in for non-decomposable graphs is circumvented by using the solutions proposed by Wang and Li (2012) and Lenkoski (2013). As a note of caution, by specifying a non-decomposable graphical model, (all else unchanged) the induced prior on R for the incomplete prime components of may not be uniform. See Supp. Mat. Section S.5 for a detailed discussion.
3.4. Sampling the response-specific parameters
Sampling the response-specific parameters Θ depends on the marginal cdfs of the BVSGCR model. If Fk(·) is the cdf of a normal distribution, since , the posterior samples of θk are obtained as a by-product of procedure described above. If the margins are ordinal or negative binomial, as in the real examples considered below, the response-specific parameters are sampled using a M-H step.
4. Simulation study
In this section, we compare the performance of the proposed Bayesian variable selection for Gaussian Copula Regression (BVSGCR) model with widely-used methods for BVS in single-response linear and non-linear regression models. We tested our multivariate method in two simulated data sets consisting of a combination of Gaussian, binary and ordinal responses, Section 4.2 and Gaussian and count responses, Section 4.3, respectively.
We used the marginal posterior probability of inclusion (MPPI) (George and McCulloch, 1997) to assess the predictor-response association, defined as the frequency a particular predictor is included in a model during the MCMC exploration. To illustrate the performance of the different methods, we used the Receiver Operating Characteristic (ROC) curve. For a given response, the ROC curve plots the proportion of correctly detected important predictors (true positive rate - TPR) against the proportion of misidentified predictors (false positive rate - FPR) over a range of specified thresholds for the MPPI. To take into account the Monte Carlo error, we reported the mean of TPR and FPR over the simulated replicates for each scenario considered along with the corresponding averaged areas under the curve and their standard deviations. We also assess the accuracy of the non-zero regression coefficients’ posterior credible intervals by using a modified version of the interval score described in Gneiting and Raftery (2007).
4.1. Data generation
To generate the correlated predictors, we followed Rothman et al. (2010) and simulated, for each i = 1, …, n and k = 1, …, m, , where is the (j, j')th element of S, j, j' = 1, …, pk, implying the same unit marginal variance. We also assumed that, for all responses, we had the same set of available predictors.
We simulated a sparse vector of regression coefficients B and a sparse inverse correlation matrix R−1 according to the structure described in Section 2.3. More precisely, we first constructed the p × m matrix B = Γ1 ⊙ Γ2 ⊙ B3, where and ⊙ denotes the Hadamard matrix product, as follows. Each cell of Γ1 has independent Bernoulli entries with success probability π1, Γ2 has rows that are either all ones or all zeros and B3 consists of independent draws from N(b, s2). The decision regarding the zero rows in Γ2 has to be made using p independent Bernoulli variables with probability of success π2. As noted by Rothman et al. (2010), using this simulation scheme, (1 − π2)p predictors are expected to be irrelevant for all the responses and each relevant predictor will be associated on average with π1m responses. We set βk to be the kth column of B. The choice of the parameters π1,π2,b and s2 will be different for each simulated scenario and it is summarized in Table 1. Finally, we used the correlation matrix of the autoregressive model of order one to simulate for each i = 1, …, n and we set for the (k, k')th element of R, k, k' = 1, …, m, which implies a tri-diagonal sparse inverse correlation matrix. Supp. Mat. Figure S.1 shows the graphs implied by the non-zero pattern of R−1 used in the simulation study.
Table 1. Values of the parameters used to simulate the predictors and response variables described in Section 4.1 and 4.2.
| Response | n | m | Pk | π 1 | π 2 | b | s 2 | |
|---|---|---|---|---|---|---|---|---|
| Scenario I & III | Gaussian | 50 | 30 | 0.15 | 0.95 | 1 | 1 | |
| Discrete | 0.5 | 0.2 | ||||||
| Scenario II & IV | Gaussian | 100 | 100 | 0.05 | 0.95 | 1 | 1 | |
| Discrete | 0.5 | 0.2 |
4.2. Mixed Gaussian, binary and ordinal responses
We tested the proposed model in two different scenarios. Both scenarios consist of m = 6 responses with three Gaussian, one binary and two ordinal (with three and four categories) variables. In Scenario II, we generated 20 replicates with 100 samples and pk = 100 predictors, k = 1, …, m, whereas in Scenario I we simulated the same number of replicates with n = 50 and pk = 30. We constructed the sparse vector of regression coefficients as described in Section 4.1 by setting the values the parameters π1, π2, b and s2 as shown in Table 1. With this choice of the parameters π1 and π2, in each scenario we simulated on average between four and five predictors associated with each response and with a small probability that they will be the same across responses since π1m < 1.
To simulate realizations from the correlated responses, for k = 1, 2, 3, we set Fk(·; xik, βk, θk) to be the cdf of the Gaussian distribution with mean and variance θk = 3. For k = 4, we set Fk(·; xik, βk, θk) to be the cdf of the Bernoulli distribution with mean and θk = 0 and for k = 5, 6, we set in order to simulate ordinal responses with C5 = 3 and C6 = 4 categories and cut-points θkc, c = 1, …, Ck − 1, which are drawn from a Unif(0, 1) and Unif(1, 2), respectively.
We used the MCMC sampler presented in Algorithm 1 to obtain posterior samples of the parameters and the latent variables of the BVSGCR model as well as to compute the MPPI for each predictor-response association. To estimate the parameters and the corresponding MPPIs for the single-response regression models, we employed widely-used MCMC algorithms for sparse linear Gaussian (George and McCulloch, 1997) and non-linear (Holmes and Held, 2006) regression models with the same proposal distribution for the selection of important predictors as described in Section 3.1. In all MCMC algorithms we chose the hyper-parameters ak and bk in eq. (5) following Kohn et al. (2001), where the mean and the variance of the beta distribution are matched with the a priori expected number of important predictors associated with each response (E(γk) = 5) and its variance (Var(γk) = 9). We also set υ = 1 in eq. (4) since all predictors have been simulated with the same unit marginal variance. Finally, Supp. Mat. Section S.1 presents the prior distribution on the response-specific parameters θk in the Gaussian and ordinal case. We ran each MCMC algorithm for 30, 000 iterations using the first 10, 000 as burn-in period, storing the outcome every 20 iterations to obtain 1, 000 posterior samples for each model.
Table 2 displays the average area under the ROC curve (standard errors in brackets) for both Scenario I and II whereas Supp. Mat. Figure S.2 presents the average (over 20 replicates) ROC curves for each one of the m = 6 responses simulated in Scenario II. Taken together, they indicate that for the analysis of correlated Gaussian, binary and ordinal responses, the BVSGCR model achieves a higher sensitivity in the Gaussian and the ordinal variables, and for the latter irrespectively of the number of the simulated categories. Similarly to the sparse SUR model with covariance selection, this is due to the ability of our model to account for the correlation between the responses which can induce false positive results when they are analyzed only marginally. In addition, the proposal distribution for the non-zero regression coefficients in eq. (14) is tailored to take advantage of the estimated sparse inverse correlation structure, resulting in a more efficient algorithm for BVS.
Table 2.
Area under the ROC curves for the BVSGCR model and for independent single-response regression models in the simulated Scenario I and II. Results are averaged over 20 replicates with standard errors in brackets. Within each response, the best performance is highlighted in bold.
| Response | Regression model | Scenario I | Scenario II |
|---|---|---|---|
| n = 50 & pk = 30 | n = 100 & pk = 100 | ||
| Gaussian | BVSGCR | 0.86 (0.08) | 0.88 (0.09) |
| Single-response | 0.77 (0.09) | 0.78 (0.10) | |
| Gaussian | BVSGCR | 0.82 (0.10) | 0.90 (0.07) |
| Single-response | 0.78 (0.11) | 0.81 (0.12) | |
| Gaussian | BVSGCR | 0.80 (0.13) | 0.91 (0.06) |
| Single-response | 0.72 (0.13) | 0.82 (0.10) | |
| Binary | BVSGCR | 0.83 (0.10) | 0.81 (0.07) |
| Single-response | 0.83 (0.10) | 0.81 (0.08) | |
| Ordinal (3 categories) | BVSGCR | 0.76 (0.10) | 0.82 (0.07) |
| Single-response | 0.74 (0.09) | 0.77 (0.09) | |
| Ordinal (4 categories) | BVSGCR | 0.85 (0.07) | 0.87 (0.07) |
| Single-response | 0.77 (0.09) | 0.82 (0.07) |
For the binary response, the performance of the BVSGCR model and single-univariate regression model is almost identical. This is in keeping with Chib and Greenberg (1998) and Talhouk et al. (2012) that the estimates of the regression coefficients in multi-response Probit regression models are robust to the specification of the correlation structure including the case R = Im which corresponds, in our framework, to the single-response Probit model.
We also investigated the effect of the “implicit marginalisation” of the regression coefficients in eq. (16) when the joint update of (βk, γk) in the M-H step is performed. To do so, we used the proposal density in eq. (15) that does not account for the correlation between responses and, more important, does not allow for the marginalization of the regression coefficients in the M-H step when Gaussian, binary and ordinal marginal distributions are jointly considered. Supp. Mat. Figure S.4 shows that a better performance is achieved across all responses and in particular in the Gaussian and ordinal case when the proposal density in eq. (14) is used.
To assess the effect of the covariance selection procedure, we present in Supp. Mat. Figure S.6, the ROC curves obtained by a specialized version of the proposed algorithm that does not allow any element of the inverse correlation matrix to be identically zero. Interestingly, the displayed ROC curves suggest that the Gaussian graphical model for covariance selection is crucial for an efficient identification of the important predictors. In particular, for the subset of Gaussian responses, the BVSGCR model with full R−1 is preferable than single-response linear regression models but it is not better than a model with sparse R−1. More important, in the case of discrete data, single-response regression models perform better in variable selection than the BVSGCR model when R−1 is a full matrix. A closer inspection of the MCMC output reveals that the selection of important predictors is affected by the difficult estimation of the inverse correlation matrix when the sample size is small and a full R−1 is enforced. With a larger sample size (n = 1, 000 data not shown) results are less affected by the specification of the covariance structure.
Finally, we also evaluated the estimation of the regression coefficients obtained by the BVSGCR model and compared with single-response regression models by using a modified version of the scoring rule presented in Gneiting and Raftery (2007). In our set-up, the interval score rewards narrow posterior credible intervals and incurs a penalty proportional to the significance level of the interval if the simulated non-zero regression coefficient is not included, see also Supp. Mat. Section S.7 for its formal definition. Figure 1 displays the boxplots (over 20 replicates) of the average interval scores for the 95% credible intervals of the non-zero simulated regression coefficients in Scenario II for the BVSGCR model and single-response regression models. It is apparent that by using the proposed model, we obtained a more accurate estimation of the non-zero regression coefficients for all the responses, except, unsurprisingly, for the binary case.
Figure 1.
Boxplot (over 20 replicates) of the average interval score for the 95% credible intervals of the non-zero simulated regression coefficients obtained by the BVSGCR model and by single-response regression models in the simulated Scenario II.
4.3. Mixed Gaussian and count responses
In this section, we present the results of the application of the BVSGCR model in a simulated experiment in which the responses consist of a combination of one Gaussian and three count responses. We followed the same strategy described in Section 4.1 to generate the set of correlated predictors and the sparse vector of regression coefficients by choosing the parameters π1, π2, b and s2 as described in Table 1 for the Gaussian and discrete responses. We also used the same correlation matrix R as described in Section 4.1.
We considered two different scenarios. Both consist of m = 4 responses with one Gaussian, two negative-binomial and one binomial. In Scenario IV, we generated 20 replicates with n = 100 samples and pk = 100 predictors, k = 1, …, m, whereas in Scenario III we simulated the same number of replicates with n = 50 and pk = 30. Using eq. (3), we simulated the responses by specifying F1(·; xi1, β1, θ1) to be the cdf of the Gaussian distribution with mean and variance θ1 = 1, for k = 2 and k = 3, Fk(·; xik, βk, θk) to be the cdf of the negative binomial distribution with mean with θ2 = θ3 = 0.5, and finally F4(·; xi4, β4, θ4) to be the cdf of the binomial distribution with θ4 = 10.
To estimate the parameters and the Gaussian latent variables of the BVSGCR model, we utilized the MCMC presented in Algorithm 1. We compared the results with the same MCMC algorithm used in Section 4.2 for the Gaussian response. For the count responses, we used the MCMC algorithm based on the representation of the negative binomial and binomial logistic regression model as a Gaussian regression model in auxiliary variables implemented in the R-package pogit by Dvorzak and Wagner (2016). After setting E(γk) = 5 and Var(γk) = 9, the parameters ak and bk of the beta prior on the probability of predictor-response association were chosen as in Section 4.2 and we matched the moments of pogit prior specification for γk with these values. Finally, Supp. Mat. Section S.1 presents the prior distributions on the response-specific parameters θk for the Gaussian and negative-binomial responses. We ran the MCMC algorithms for 30, 000 iterations, storing the outcome every 20 iterations, after a burn-in period of 10, 000 iterations to obtain 1, 000 posterior samples for each model.
Table 3 displays the average area under the ROC curve for both Scenario III and IV, whereas Supp. Mat. Figure S.3 presents the average (over 20 replicates) ROC curves for each one of the m = 4 simulated responses in Scenario IV. Overall, it is evident, especially in the simulated Scenario IV, that by employing the proposed model, we achieved a better selection of important predictors compared to single-response regression models, apart from the binomial response. Similarly to the binary case, it seems there isn’t a clear advantage of the BVSGCR model over single-response non-linear regression models when the marginal distribution is only parameterized by the probability of success.
Table 3.
Area under the ROC curves for the BVSGCR model and for independent single-response regression models in the simulated Scenario III and IV. Results are averaged over 20 replicates with standard errors in brackets. Within each response, the best performance is highlighted in bold.
| Response | Regression model | Scenario III | Scenario IV |
|---|---|---|---|
| n = 50 & pk = 30 | n = 100 & pk = 100 | ||
| Gaussian | BVSGCR | 0.82 (0.13) | 0.86 (0.10) |
| Single-response | 0.80 (0.13) | 0.80 (0.12) | |
| Negative Binomial | BVSGCR | 0.70 (0.16) | 0.76 (0.11) |
| Single-response | 0.68 (0.15) | 0.72 (0.11) | |
| Negative Binomial | BVSGCR | 0.72 (0.14) | 0.76 (0.10) |
| Single-response | 0.70 (0.15) | 0.71 (0.13) | |
| Binomial | BVSGCR | 0.91 (0.10) | 0.91 (0.07) |
| Single-response | 0.91 (0.09) | 0.91 (0.06) |
Figure 2 displays the boxplot (over 20 replicates) of the average interval scores for the 95% credible intervals of the non-zero simulated regression coefficients in Scenario IV for the BVSGCR model and single-response regression models. The boxplots indicate that our model delivers more accurate estimates of the simulated regression coefficients than those obtained by using single-response regression models across all responses, including the binomial case. Thus, while the ROC curve for the binomial response shows that the ranking of the predictors based on the estimated MPPI is the same between the proposed model and the pogit algorithm, BVSGCR attains on average narrower 95% credible intervals.
Figure 2.
Boxplot (over 20 replicates) of the average interval scores for the 95% credible interval of the non-zero simulated regression coefficients obtained by the BVSGCR model and by single-response regression models in the simulated Scenario IV.
For the simulated Scenario IV, Supp. Mat. Figure S.7 compares the performance of the BVSGCR model with its specialized version when R−1 is a full matrix. Interestingly, the performance is almost identical and both are better than single-response regression models, except for the binomial case. A closer look at the MCMC output reveals that, despite a sparse simulated inverse correlation structure and a small sample size (n = 100), the estimation of a full R−1 is still feasible in this scenario with four simulated responses.
There is also another possible explanation regarding these results which is apparent in Supp. Mat. Figure S.5. With the exemption of the Gaussian outcome, for the binomial and negative-binomial responses, there is no “implicit marginalisation” of the regression coefficients in the M-H step. In this case, the proposal distribution in eq. (14) that takes into account the correlation between the responses performs no better than the proposal distribution in eq. (15) that does not make use of this information. It turns out that, when the “implicit marginalisation” is not possible, the acceptance probability of the joint update of (βk, γk) in the M-H step seems to not take advantage of how accurately R−1 is estimated and used in the proposal distribution.
5. Real data applications
We illustrate the features of the proposed BVSGCR model by applying it to two real data sets, Ataxia-Telangiectasia disorder and individuals suffering from Temporal Lobe Epilepsy, which are typical examples of data routinely collected in clinical research where a combination of discrete and/or continuous outcome variables are used to assess patients’ prognosis and disease progression. In the analysis of both real data sets our aim is twofold: (i) the identification of important associations between the outcome variables and the predictors that are either “response-specific” or “shared”, i.e., predictors that are linked with several responses at the same time and (ii) the estimation the correlation pattern between the responses in order to shed light on their conditional dependence not explained by set of predictors considered. Finally, we also assessed the out-of-sample prediction accuracy of the BVSGCR model by employing the method proposed by Vehtari et al. (2017) in order to conduct approximate leave-one-out (LOO) cross-validation. In particular, we compared the predictive performance of the proposed model with the performance of single-response regression models by using the R-package loo (Vehtari et al., 2018).
5.1. MCMC details
The two real data sets have missing values on both the responses and the predictors. Missing values (completely at random or at random) are a frequent occurrence in clinical data since the propensity for a data point to be missing is either completely random or linked to some characteristics of the observed data. To overcome this problem, missing values in the outcome variables were imputed by modifying the designed MCMC presented in Algorithm 1 as suggested by Zhang et al. (2015) (see Supp. Mat. Section S.6). Missing values in the predictors were imputed using the median of the observed values for each variable. To produce the results presented in this section, we ran the MCMC algorithms for 70, 000 iterations. We considered the first 20, 000 as burn-in and then we stored the output every 50th iteration in order to obtain 1, 000 (thinned) samples from the posterior distributions of interest.
5.2. Ataxia-Telangiectasia disorder
5.2.1. Data and BVSGCR model
We applied our model on a data set containing the measurements of 46 individuals suffering from Ataxia-Telangiectasia (A-T) disorder. A-T is a rare neurodegenerative disorder induced by mutations in the ATM gene. Our data set is a subset of a larger multicentric cohort of 57 patients presented in Schon et al. (2019).
The data set includes nine neurological responses (% missing values in brackets) and 13 predictors. In particular, four responses are continuous variables named as: Scale for Assessment and Rating of Ataxia (SARA) score (26%), Ataxia-Telangiectasia Neurological Examination Scale Toolkit (A-T NEST) score (17%), Age at first Wheelchair use (0%) and Alpha-Fetoprotein (AFP) levels (28%). Two responses are binary variables indicating the presence of Malignancy (0%) and the presence of Peripheral Neuropathy (11%) which is a term for a group of conditions in which the peripheral nervous system is damaged. Finally, three responses are ordinal variables which measure the overall Severity of the disorder (2%), its Progression (2%) and Eye Movements of the patients (2%). The set of predictors includes genetic (Genetic Group, Missense Mutation, i.e., single base mutation responsible for the production of a different amino acid from the usual one, number of Mild Mutations, ATM Protein levels and Chromosomal Radiosensitivity, i.e. whether X-ray exposure induces chromosomal aberrations in individuals with A-T) and immunological (immunoglobulin IgM, IgG2, IgG, IgA and IgE and immune CD4 and CD8 T-cell counts and CD19 B-cell counts) characteristics of the patients. In addition, we used an intercept term (with a diffuse normal prior centered in zero) and three confounders (age, gender and age of onset) always included in the regression model. Both confounders and predictors were standardized and the continuous variables quantile-transformed before the analysis.
We modeled the responses by specifying the BVSGCR model as follows. For each i = 1, …, n we set: for k = 1, …, 4, Fk(·; xik, βk, θk) to be the cdf of the Gaussian distribution with mean and variance θk; for k = 5, 6, Fk(·; xik, βk, θk) to be the cdf of the Bernoulli distribution with probability of success and θk = 0; for k = 7, 8, 9, , where denotes the cut-points of the ordinal responses with C7 = C9 = 3 and C8 = 4 categories, respectively. We used the prior distributions presented in Section 2.3 and in Supp. Mat. Section S.1. Finally, we set E(γk) = 3 and Var(γk) = 2 for all k. These values imply a priori a range of associations for each response between 0 and 8.
5.2.2. Results
Figure 3 displays the estimated MPPI for each predictor-response pair. Despite the small sample size, strong associations are detected in SARA score, A-T NEST score, AFP levels, Peripheral Neuropathy and Eye Movements and some evidence of association in Malignancy, Severity and Progression. Amongst the genetic predictors, Missense Mutation seems to play an important role in predicting the disease status and its progression (Eye Movements, Severity and Progression) as well as Malignancy and Peripheral Neuropathy (Schon et al., 2019). The number of Mild Mutations appears to influence A-T NEST score and AFP levels. Regarding the role of the immune system, it is well documented that patients suffering A-T have often a weakened defence mechanism. We confirm this clinical finding and, in contrast to the genetic risk factors, immunological predictors seem to be more “response-specific”.
Figure 3.
Detection of important predictors for the neurological responses in the A-T data set. Marginal posterior probabilities of inclusion (MPPI) measures the strength of the predictor-response association.
Given the small sample size and (potentially important) unmeasured covariates, we do not expect that the genetic and immunological predictors are able to explain entirely the variability of the responses and their covariation. Therefore, it is important to model any source of extra variability that may induce false positives associations. Figure 4 shows the conditional dependence structure of the responses estimated by the BVSGCR algorithm with a decomposable model specification. Disease status and its progression (Severity and Progression) are closely linked with A-T NEST score and Age at first Wheelchair use, the latter also strongly related. Interestingly, Severity and Progression seem to capture different aspects of the disease since they are almost conditionally independent once the effect of Missense Mutation is accounted for (see Figure 3). A-T NEST score is also important in predicting the level of SARA score and Eye Movements. Finally, SARA score seems to be a good proxy for Peripheral Neuropathy.
Figure 4.
Graph implied by the non-zero pattern of R−1 in the A-T data set and estimated by using the edge posterior probabilities of inclusion (EPPI) obtained by specifying a decomposable graphical model. Gray scale and edge thickness specify different levels of the EPPI (black thick line indicates large EPPI values).
We have also checked whether the assumption of decomposability is supported by the data. When a non-decomposable graphical model is specified, we found that the number of edges with a non-zero posterior probability of inclusion is higher than in the decomposable case, although there is a good agreement between the two instances of the BVSGCR algorithm regarding the most important edges, see Supp. Mat. Figure S.13. Interestingly, the posterior mass is equally split between the two graphical models’ specifications, with a small advantage for the non-decomposable case (53%). The full results are reported in Supp. Section S.8.7.
Table 4 presents, for each response, the estimated difference in the expected log-pointwise predictive density (Vehtari et al., 2017) between the BVSGCR model and single-response regression models. To ensure a fair comparison, we used the same prior specification for the latent binary vector and implemented the same search algorithm, see Supp. Mat. Section S.2. It is clear that the proposed model delivers more accurate predictions than those obtained by any single-response regression models. As expected from the simulation study, for the binary responses (Peripheral Neuropathy and Malignancy) the difference is less remarkable, with the lowest ELPPD difference in Malignancy which is uncorrelated with the other responses and only mildly associated with Missense Mutation. Finally, Supp. Mat. Figure S.8 presents the associations detected by single-response regression models in the A-T data set.
Table 4.
Difference in the expected log-pointwise predictive density (ELPPD) between the BVS-GCR model and independent single-response regression models. A positive difference indicates that the proposed model has better predictive performance (standard errors in brackets).
| Type | Response | Difference in ELPPD |
|---|---|---|
| Gaussian | SARA | 23.3 (6.4) |
| Gaussian | A-T NEST | 23.0 (6.7) |
| Gaussian | Age Wheelchair | 25.9 (7.1) |
| Gaussian | Alpha-Fetoprotein | 19.4 (5.4) |
| Binary | Peripheral Neuropathy | 4.1 (1.8) |
| Binary | Malignancy | 0.1 (0.6) |
| Ordinal (3 categories) | Severity | 17.1 (4.5) |
| Ordinal (4 categories) | Progression | 35.6 (7.6) |
| Ordinal (3 categories) | Eye Movements | 15.1 (2.7) |
5.3. Patients with Temporal Lobe Epilepsy
5.3.1. Data and BVSGCR model
We also applied the BVSGCR model to a second data set consisting of m = 5 responses and pk = 162, k = 1, …, m, common predictors. We investigated the relationship between human cognition and epilepsy based on recent data collected by Johnson et al. (2016). The authors used n = 122 fresh-frozen whole-hippocampus samples, surgically resected from patients with Temporal Lobe Epilepsy (TLE) in order to determine whether genes belonging to their inferred gene-regulatory networks are related with human memory abilities and the number of seizures measured on the same individuals. More precisely, the responses (% of missing values in brackets) comprise the average number of self-reported daily Seizures for each patient (10% as we excluded some extremely large observations likely due to errors in self-reported number of Seizures), the memory category in which the patients have been assigned after the assessment by a neurologist (14%) and the results (Learning (15%), Post-Interference (15%) and Delayed Recall (15%)) of the Verbal Learning Test (Thiel et al., 2016) that quantifies the human cognition abilities. We also considered five confounding predictors: sex, age of manifestation of epilepsy, age at neurological assessment, anti-epileptic drugs load, handedness and laterality (brain lobe) of TLE. The 162 correlated genes were obtained from a network analysis described in Johnson et al. (2016). Both confounders and gene expression predictors were standardized to have unit variance.
We modeled the number of self-reported seizures with a negative binomial distribution and we used an ordinal variable for the memory categories in which the patients have been assigned by the neurologist. Finally, we assumed that the number of correct words that each patient recalls in each one of the three tasks of the Verbal Learning Test is distributed as a binomial random variable with 15 trials. To set the hyper-parameters of the prior that controls the level of sparsity, we followed the same procedure used in the simulation study with E(γk) = 5 and Var(γk) = 9 for all k. The prior distributions on θk for the negative binomial and the ordered categorical responses are presented in Section 2.3.
5.3.2. Results
Figure 5 displays the estimated MPPI of the associations between the 162 correlated genes and cognition abilities, the number of seizures and memory classification. The most striking finding is the ubiquitous role of RBFOX1 gene in Learning, Post-Interference and Delayed Recall as well as in Memory Category. It has been shown that mutations in this gene lead to neurodevelopmental disorder and it has also recently implied in cognitive functions (Davies et al., 2018). Regarding the Learning task, animal model studies have revealed critical functions for GABRB1 gene for maintenance of functioning circuits in the adult brain (Gehman et al., 2011). The genetic regulation of Delayed Recall is more complex and related to the difficult memory task that individuals were asked to perform.
Figure 5.
Detection of important genes that predict human cognition abilities, number of Seizure and memory classification in 122 individuals with TLE. Marginal posterior probabilities of inclusion (MPPI) measures the strength of the predictor-response association. For each response, only associations with MPPI > 0.05 are highlighted.
In contrast to cognition abilities, the associations with the number of Seizures are less strong. This phenomenon can be explained by the quality of the self-reported data: before the analysis, we removed 10 measurements that appeared to be outliers. Despite that, the association with WNT3 gene seems interesting since deregulation in WNT signalling has a fundamental role in the origin of neurological diseases (Oliva et al., 2013).
We conclude the description of the association results by comparing the outcome of the proposed BVSGCR model with single-response regression models, see Supp. Mat. Figure S.9. To conduct BVS for the ordinal response, we used the method proposed by Holmes et al. (2002) and for negative binomial and binomial responses we utilized the auxiliary mixture sampling method of Frühwirth-Schnatter et al. (2009) implemented in the R-package pogit (Dvorzak and Wagner, 2016). For the latter, we matched the moments of the prior on γk with the hyper-parameters of the BVSGCR sparsity prior. From the comparisons, we noticed that for the binomial responses the number of associations identified by the single-response regression model is either too large (Learning and Delayed Recall) or almost nil (Post-Interference). This may depend on the Gibbs sampling search algorithm implemented in the R-package pogit that does not perform well when a large number of correlated predictors are considered (Bottolo and Richardson, 2010). In contrast, our proposal distribution for γk allows the quick detection of relevant predictors that explain a large fraction of the responses’ variability, see Supp. Mat. Figure S.11.
Figure 6 presents the conditional independence graph for the group of responses considered when a decomposable graphical model is assumed. Similarly to the A-T disorder, we do not expect to capture the whole responses’ variability and their covariation given the set of predictors considered and (potentially important) unmeasured covariates. Interestingly, the responses of the Verbal Learning Test are all connected, with Memory Category linked only with Delayed Recall, suggesting that the neurologist’s patients classification strongly reflects the Delayed Recall score. Finally, the number of Seizures is conditionally independent of the memory tasks of the Verbal Learning Test. This can be explained either by the self-reported quality of the data or by the fact that we removed the effect of age of onset which is known to be negatively correlated with both Seizures and cognition abilities.
Figure 6.
Graph implied by the non-zero pattern of R−1 in the TLE data set and estimated by using the edge posterior probabilities of inclusion (EPPI) obtained by specifying a decomposable graphical model. Gray scale and edge thickness specify different levels of the EPPI (black thick line indicates large EPPI values).
We also checked if the assumption of decomposability of the graphical model is realistic. The support for non-decomposable graphs is overwhelming with 96% of the posterior mass concentrated on it. There is a general agreement regarding the detection of important edges between the two instances of the BVSGCR model, although for non-decomposable graphs, neurologist’s memory classification of the patients seems to be based, more reasonably, on the whole set of results of the Verbal Learning Test, see Supp. Mat. Figure S.15.
Finally, Table 5 presents, for each response, the estimated difference in the expected log-pointwise predictive density (Vehtari et al., 2017) between the BVSGCR model and single-response regression models. They are all largely positive, indicating that the proposed model has better predictive performance, apart from the number of Seizures which is weakly associated with the set of genes and conditionally independent from the other responses. In any case, the small difference and the large standard deviation make it difficult to draw any clear conclusion about the best predictive model for this trait.
Table 5.
Difference in the expected log-pointwise predictive density (ELPPD) between the BVS-GCR model and independent single-response regression models. A positive difference indicates that the proposed model has better predictive performance (standard errors in brackets).
| Type | Response | Difference in ELPPD |
|---|---|---|
| Binomial | Learning | 51.1 (9.6) |
| Binomial | Post-Interference | 49.4 (10.4) |
| Binomial | Delayed Recall | 93.1 (14.5) |
| Neg. Binomial | Seizures | -8.8 (10.4) |
| Ordinal (5 categories) | Memory Category | 59.3 (12.9) |
6. Discussion
In this paper, we have presented a novel approach for BVS when a combination of discrete and/or continuous responses are jointly considered. The proposed method allows the exploration of the model space consisting of all possible subsets of predictors while estimating the conditional dependence structure among the responses and vice versa.
We have shown that for some continuous and discrete outcomes, including Gaussian, binary and ordinal responses, the regression coefficients can be “implicitly marginalised” in the acceptance probability of the M-H step for the joint update of (βk, γk). This allows the reduction of the posterior correlation between these two quantities and thus improves the mixing of the designed sampler. The “implicit marginalisation” of the non-zero regression coefficients holds also when an unordered categorical variable is considered. In contrast to binary or ordinal responses, where only one latent variable is required, in the unordered categorical case, the state-space is expanded by C −1 latent variables where C is the number of categories. Combined with an effective proposal distribution for the latent binary vector, based on the marginal screening of important predictors for each response, our approach allows an efficient exploration of the ultra-high predictors model space.
We tested the proposed method on simulated and real data sets and compared it with widely-used sparse Bayesian single-response linear and non-linear regression models. In all examples considered, BVSGCR outperformed existing BVS algorithms in terms of selection of important predictors and/or estimation of the non-zero regression coefficients, except for the binary case. In the simulation study, we have also demonstrated that covariance selection is key when the sample size is small and the number of responses is large.
We conclude with some final remarks regarding directions for future research. As we have demonstrated in the simulation study, in the case of multiple-response count data, the proposed BVSGCR method performs better than single-response regression models because it exploits the Gaussian copula model. However, in the same simulated example, we have shown that the proposal distribution that takes into account the correlation between responses performs no better than the proposal distribution that does not use this information. When the “implicit marginalisation” is not possible, the M-H step doesn’t seem to take advantage of how accurately the residual correlations between the responses are estimated and this information used in the proposal distribution. Thus, it is paramount to specify the cdf of the marginal distribution so that it allows the marginalization of the regression coefficients. For count data, this may be accomplished by using the generalized ordered-response Probit (GORP) model presented in Castro et al. (2012) which can be expressed as a function of the Gaussian cdf. An interesting venue for future work will be the assessment of the similarities of the GORP model with Generalized Linear Models to fully exploits the benefits of the marginalization of the regression coefficients and the estimation of the residual correlations between the responses in the analysis of multivariate count data.
In summary, our new BVSGCR algorithm is tailored to jointly analyze correlated responses of diverse types with missing observations and a large set of predictors. Besides the application to clinically relevant data, the proposed method can be also used in the analysis of other problems for which sparse regression algorithms for combinations of discrete and/or continuous responses are required.
Supplementary Material
Acknowledgments
The authors gratefully acknowledge the MRC grant MR/M013138/1, The BHF-Turing Cardiovascular Data Science Awards 2017 and The Alan Turing Institute under the Engineering and Physical Sciences Research Council grant EP/N510129/1. The authors are thankful to Katherine Schon and Michael Johnson for providing the data of the Ataxia-Telangiectasia disorder and Temporal Lobe Epilepsy examples and to Hélène Ruffieux and Verena Zuber for insightful comments. The authors are also grateful to the editor, associate editor and two anonymous referees for their valuable comments that greatly improved the presentation of the paper.
Contributor Information
A. Alexopoulos, Department of Statistical Science, University College London
L. Bottolo, Department of Medical Genetics, University of Cambridge, The Alan Turing Institute, London, MRC Biostatistics Unit, University of Cambridge
References
- Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993;88(422):669–679. [Google Scholar]
- Bhadra A, Rao A, Baladandayuthapani V. Inferring network structure in non-normal and mixed discrete-continuous genomic data. Biometrics. 2018;74(1):185–195. doi: 10.1111/biom.12711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bornn L, Caron F, et al. Bayesian clustering in decomposable graphs. Bayesian Anal. 2011;6(4):829–846. [Google Scholar]
- Bottolo L, Richardson S. Evolutionary stochastic search for Bayesian model exploration. Bayesian Anal. 2010;5(3):583–618. [Google Scholar]
- Bové DS, Held L. Hyper- g priors for generalized linear models. Bayesian Anal. 2011;6(3):387–410. [Google Scholar]
- Brooks SP, Giudici P, Roberts GO. Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions. J R Stat Soc Series B Stat Methodol. 2003;65(1):3–39. [Google Scholar]
- Brown PJ, Vannucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc Series B Stat Methodol. 1998;60(3):627–641. [Google Scholar]
- Castro M, Paleti R, Bhat CR. A latent variable representation of count data models to accommodate spatial and temporal dependence: Application to predicting crash frequency at intersections. Transport Res B-Meth. 2012;46(1):253–272. [Google Scholar]
- Chib S, Greenberg E. Analysis of multivariate probit models. Biometrika. 1998;85(2):347–361. [Google Scholar]
- Davies G, Lam M, Harris SE, Trampush JW, et al. Study of 300,486 individ- uals identifies 148 independent genetic loci influencing general cognitive function. Nat Commun. 2018;9(1):2098. doi: 10.1038/s41467-018-04362-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dawid AP, Lauritzen SL. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann Stat. 1993;21(3):1272–1317. [Google Scholar]
- Dellaportas P, Forster JJ, Ntzoufras I. On Bayesian model and variable selection using MCMC. Stat Comput. 2002;12(1):27–36. [Google Scholar]
- Deshpande SK, Ročková V, George EI. Simultaneous variable and covari- ance selection with the multivariate spike-and-slab lasso. J Comput Graph Stat. 2019;28(4):1–11. [Google Scholar]
- Dvorzak M, Wagner H. Sparse Bayesian modelling of underreported count data. Stat Model. 2016;16(1):24–46. [Google Scholar]
- Forster JJ, Gill RC, Overstall AM. Reversible jump methods for gener- alised linear models and generalised linear mixed models. Stat Comput. 2012;22(1):107–120. [Google Scholar]
- Frühwirth-Schnatter S, Frühwirth R, Held L, Rue H. Improved auxiliary mixture sampling for hierarchical models of non-gaussian data. Stat Comput. 2009;19(4):479–492. [Google Scholar]
- Gehman LT, Stoilov P, Maguire J, Damianov A, et al. The splicing regula- tor RBFOX1(A2BP1) controls neuronal excitation in the mammalian brain. Nature Genet. 2011;43(7):706. doi: 10.1038/ng.841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sin. 1997;7(2):339–373. [Google Scholar]
- Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102(477):359–378. [Google Scholar]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711–732. [Google Scholar]
- Guan Y, Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011;5(3):1780–1815. [Google Scholar]
- Hans C, Dobra A, West M. Shotgun stochastic search for “large p” regression. J Am Stat Assoc. 2007;102(478):507–516. [Google Scholar]
- Hoff PD. Extending the rank likelihood for semiparametric copula estimation. Ann Appl Stat. 2007;1(1):265–283. [Google Scholar]
- Holmes C, Denison DT, Mallick B. Accounting for model uncertainty in seemingly unrelated regressions. J Comput Graph Stat. 2002;11(3):533–551. [Google Scholar]
- Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 2006;1(1):145–168. [Google Scholar]
- Johnson MR, Shkura K, Langley SR, Delahaye-Duriez A, et al. Systems ge- netics identifies a convergent gene network for cognition and neurodevelopmental disease. Nat Neurosci. 2016;19(2):223–232. doi: 10.1038/nn.4205. [DOI] [PubMed] [Google Scholar]
- Kohn R, Smith M, Chan D. Nonparametric regression using linear combina- tions of basis functions. Stat Comput. 2001;11(4):313–322. [Google Scholar]
- Lamnisos D, Griffin JE, Steel MF. Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J Comput Graph Stat. 2009;18(3):592–612. [Google Scholar]
- Lauritzen SL. Graphical Models. Clarendon Press; 1996. [Google Scholar]
- Lenkoski A. A direct sampler for G-Wishart variates. Stat. 2013;2(1):119–128. [Google Scholar]
- Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g-priors for Bayesian variable selection. J Am Stat Assoc. 2008;103(481):410–423. [Google Scholar]
- Liu JS, Wu YN. Parameter expansion for data augmentation. J Am Stat Assoc. 1999;94(448):1264–1274. [Google Scholar]
- McCullagh P. Regression models for ordinal data. J R Stat Soc Series B Stat Methodol. 1980;42(2):109–142. [Google Scholar]
- McCullagh P, Nelder JA. Generalized Linear Models. CRC press; 1989. [Google Scholar]
- Mohammadi A, Wit EC. BDgraph: An R package for Bayesian structure learning in graphical models. J Stat Softw. 2019;89(3):1–30. [Google Scholar]
- Murray JS, Dunson DB, Carin L, Lucas JE. Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc. 2013;108(502):656–665. doi: 10.1080/01621459.2012.762328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oliva CA, Vargas JY, Inestrosa NC. Wnts in adult brain: from synaptic plasticity to cognitive deficiencies. Front Cell Neurosci. 2013;7:224. doi: 10.3389/fncel.2013.00224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pitt M, Chan D, Kohn R. Efficient Bayesian inference for Gaussian copula regression models. Biometrika. 2006;93(3):537–554. [Google Scholar]
- Richardson S, Bottolo L, Rosenthal J. In: Bayesian Statistics. Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Vol. 9. Oxford University Press; 2010. Bayesian models for sparse regression analysis of high dimensional data; pp. 539–568. [Google Scholar]
- Ročková V, George EI. The spike-and-slab lasso. J Am Stat Assoc. 2018;113(521):431–444. [Google Scholar]
- Rothman AJ, Levina E, Zhu J. Sparse multivariate regression with covari- ance estimation. J Comput Graph Stat. 2010;19(4):947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruffieux H, Davison AC, Hager J, Inshaw J, Fairfax BP, Richardson S, Bottolo L, et al. A global-local approach for detecting hotspots in multiple-response regression. Ann Appl Stat. 2020;14(2):905–928. doi: 10.1214/20-AOAS1332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schon K, van Os NJ, Oscroft N, Baxendale H, et al. Genotype, extrapyramidal features, and severity of variant ataxia-telangiectasia. Ann Neurol. 2019;85(2):170–180. doi: 10.1002/ana.25394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sklar M. Fonctions de répartition à n dimensions et leurs marges. Publ Inst Statist Univ Paris. 1959;10(8):229–231. [Google Scholar]
- Song PX-K. Multivariate dispersion models generated from Gaussian copula. Scand J Stat. 2000;27(2):305–320. [Google Scholar]
- Song PX-K, Li M, Yuan Y. Joint regression analysis of correlated data using Gaussian copulas. Biometrics. 2009;65(1):60–68. doi: 10.1111/j.1541-0420.2008.01058.x. [DOI] [PubMed] [Google Scholar]
- Talhouk A, Doucet A, Murphy K. Efficient Bayesian inference for multivariate Probit models with sparse inverse correlation matrices. J Comput Graph Stat. 2012;21(3):739–757. [Google Scholar]
- Thiel CM, zyurt Jö, Nogueira W, Puschmann S. Effects of age on long term memory for degraded speech. Front Hum Neurosci. 2016;10:473. doi: 10.3389/fnhum.2016.00473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Dyk DA, Meng X-L. The art of data augmentation. J Comput Graph Stat. 2001;10(1):1–50. [Google Scholar]
- Vehtari A, Gabry J, Yao Y, Gelman A. LOO: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 2.0.0. 2018 [Google Scholar]
- Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput. 2017;27(5):1413–1432. [Google Scholar]
- Wang H. Sparse seemingly unrelated regression modelling: Applications in finance and econometrics. Comput Stat Data Anal. 2010;54(11):2866–2877. [Google Scholar]
- Wang H, Li SZ. Efficient Gaussian graphical model determination under G-Wishart prior distributions. Electron J Stat. 2012;6:168–198. [Google Scholar]
- Webb EL, Forster JJ. Bayesian model determination for multivariate ordinal and binary data. Comput Stat Data Anal. 2008;52(5):2632–2649. [Google Scholar]
- Yu Y, Meng X-L. To center or not to center: That is not the questionan Ancillarity–Sufficiency Interweaving Strategy (ASIS) for boosting MCMC efficiency. J Comput Graph Stat. 2011;20(3):531–570. [Google Scholar]
- Zellner A. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J Am Stat Assoc. 1962;57(298):348–368. [Google Scholar]
- Zhang X, Boscardin WJ, Belin TR, Wan X, He Y, Zhang K. A Bayesian method for analyzing combinations of continuous, ordinal, and nominal categorical data with missing values. J Multivar Anal. 2015;135:43–58. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






