Functional regression via variational Bayes

Jeff Goldsmith; Matt P Wand; Ciprian Crainiceanu

doi:10.1214/11-ejs619

. Author manuscript; available in PMC: 2011 Dec 8.

Published in final edited form as: Electron J Stat. 2011 Jan 1;5:572–602. doi: 10.1214/11-ejs619

Functional regression via variational Bayes

Jeff Goldsmith ^1,^*,^†, Matt P Wand ^2,^‡, Ciprian Crainiceanu ^3,^*

PMCID: PMC3234121 NIHMSID: NIHMS338546 PMID: 22163061

Abstract

We introduce variational Bayes methods for fast approximate inference in functional regression analysis. Both the standard cross-sectional and the increasingly common longitudinal settings are treated. The methodology allows Bayesian functional regression analyses to be conducted without the computational overhead of Monte Carlo methods. Confidence intervals of the model parameters are obtained both using the approximate variational approach and nonparametric resampling of clusters. The latter approach is possible because our variational Bayes functional regression approach is computationally efficient. A simulation study indicates that variational Bayes is highly accurate in estimating the parameters of interest and in approximating the Markov chain Monte Carlo-sampled joint posterior distribution of the model parameters. The methods apply generally, but are motivated by a longitudinal neuroimaging study of multiple sclerosis patients. Code used in simulations is made available as a web-supplement.

Keywords and phrases: Approximate Bayesian inference, Markov chain Monte Carlo, penalized splines

1. Introduction

Due to ever-expanding methods for the acquisition and storage of information, functional data is often encountered in scientific applications. A common problem in the field of functional data analysis is determining the relationship between a scalar outcome Y and a densely observed functional predictor X(t) [18, 22, 9]. Increasingly, this problem is longitudinal in that both the functional predictors and scalar outcomes are observed at several visits for each subject. Bayesian approaches to cross-sectional and longitudinal functional regression possess a number of advantages, including the ability to jointly model the observed functions and scalar outcomes and easily constructed credible intervals [5, 7]. However, these approaches require computationally expensive Markov chain Monte Carlo (MCMC) simulations of joint posterior distributions. The goal of this paper is to introduce a fast and scalable alternative to accommodate new types of data sets.

Variational approximations, now regularly used in computer science, are a collection of techniques for deriving approximate solutions to inference problems [10, 11, 25]. They have a growing visibility in the statistics literature [16, 19, 24]. In the Bayesian context, these methods are useful in approximating intractable posterior density functions. While this approximation sacrifices some of MCMC’s accuracy, it provides large gains in terms of computational feasibility, especially in large-data settings.

In this article, we derive an iterative algorithm for approximate Bayesian inference in functional regression. Using this algorithm, inference on model parameters can be obtained several orders of magnitude faster than MCMC sampling methods. Importantly, the construction of credible intervals for the functional coefficient is straightforward. Moreover, this procedure retains the ability to jointly model the predictor process and the scalar outcome. The computational advantage conveyed by the variational methods also allows resampling techniques, such as the nonparametric bootstrap of subjects, to be used. Unlike MCMC sampling, the variational approach cannot be made arbitrarily accurate. However, simulations indicate that the quality of the approximation is high in our setting. Our variational method is not designed to replace MCMC sampling, but it is a useful additional inferential tool in that it provides near-instant and highly accurate approximate posterior distributions. This will become increasingly relevant as functional datasets become larger and more complex.

In particular, we develop variational Bayes methods for two functional regression models: the classic cross-sectional case, in which a single scalar outcome and functional predictor are observed for each subject; and the more recent longitudinal case, in which scalar outcomes and functional predictors are observed repeatedly for each subject. This methodology is based on a penalized approach to functional regression that is flexible and widely applicable [6]. Although variational techniques typically incur initial algebraic and implementation costs, the present article alleviates these considerations.

We apply the methods developed to a longitudinal neuroimaging study, in which multiple sclerosis patients undergo both tests of cognitive ability and a diffusion tensor imaging scan at each of several visits. From the diffusion tensor imaging scans, we construct functional predictors that provide detailed quantitative information about major white matter fiber bundles (see Figure 1). Because multiple sclerosis results in the degradation of cerebral white matter, researchers hope to use the functional predictors and cognitive disability measures to understand the progression of the disease. This study was previously analyzed in [7].

Fig 1 — The functional predictor used in our diffusion tensor imaging application. The left and middle panels show the functional predictors observed for individual subjects; the right panel shows the collection of all observed functions.

In Section 2 we introduce variational Bayes and a penalized approach to functional regression. Section 3 combines these ideas and develops a scalable iterative algorithm for approximate Bayesian inference in functional regression. The results of a simulation study are described in Section 4 and a real-data analysis is performed in Section 5. We conclude the main text with a discussion in Section 6. Appendices A and B contain algebraic derivations and expressions used in the construction of the iterative algorithm. All code used in the simulation study is available as a web-supplement to this article.

2. Background

In the following subsections we introduce variational approximations for Bayesian inference and an approach to functional regression which uses penalized B-splines to estimate the coefficient function.

2.1. Variational Bayes

Here we give an overview of variational Bayes; for a more complete treatment see [19] and [3], Chapter 10.

Bayesian inference is based on the posterior density function

p (θ ∣ y) = \frac{p (y, θ)}{p (y)}

where θ ∈ Θ is the parameter vector, y is the observed data, p(y) is the marginal likelihood of the observed data, and p(y, θ) is the joint likelihood of the data and model parameters. The goal of the density transform approach is to approximate the posterior density p(θ|y) by a function q(θ) for which the q-specific lower bound on the marginal likelihood (defined below) is more tractable than the marginal likelihood itself. The first step is to restrict q to a more manageable class of densities and choose the element of that class with minimum Kullback-Leibler distance from p(θ|y).

More concretely, let q be an arbitrary density function over Θ. Then

log p (y) \geq \int q (θ) log {\frac{p (y, θ)}{q (θ)}} d θ .

(2.1)

with equality if and only if q(θ) = p(θ|y) almost everywhere [12]. It follows that $p (y) \geq exp \int q (θ) log {\frac{p (y, θ)}{q (θ)}} d θ$ ; we define the q-specific lower bound on the marginal likelihood as

\underline{p} (y; q) \equiv exp \int q (θ) log {\frac{p (y, θ)}{q (θ)}} d θ .

(2.2)

It can be shown that minimizing the Kullback-Leibler distance between q(θ) and p(θ|y) is equivalent to maximizing p(y; q). Stated generally, the following result holds.

Result 2.1

Let u and v be continuous random vectors with joint density p(u, v). Then

s u p_{q} {\int q (u) log [\frac{p (u, v)}{q (u)} d u]}

is achieved by q^*(u) = p(u|v).

Next, we restrict q to a class of functions for which p(y; q) is more tractable than p(y). While several restrictions are possible, here we focus on the product density transform: we assume that for some partition {θ₁, …, θ_L} of θ it is possible to write $q (θ) = \prod_{l = 1}^{L} q_{l} (θ_{l})$ . In the functional regression setting, the posterior dependence of some subsets of the model parameters is weak and the assumption that q factorizes provides accurate approximate inference. In other settings where the posterior dependence of parameters is stronger, this assumption may lead to poor approximations and inference due to the failure to account for correlations between model parameters. There are three simple strategies to gain insight into what sets of parameters are a-posteriori weakly correlated: 1) theoretical work on asymptotic posterior correlations; 2) Bayesian inference on smaller or simpler data sets; and 3) prior experience. If posterior correlation is potentially problematic, a more flexible component density q_l that allows for this correlation could be used; however, this must be balanced against the simplification desired in the approximating class of functions. While none of the approaches above is infallible, when combined with powerful variational approximations they can provide a valuable alternative to Bayesian inference. The methods provided in this paper are intended as a reasonable and tractable complement of and not replacement for Bayesian computations.

Combining the assumption that q factorizes over a partition of θ with Result 2.1, we can derive explicit solutions for each factor q_l(θ_l), 1 ≤ l ≤ L, in terms of the remaining factors. Solving for each factor in terms of the others leads to an iterative algorithm for obtaining a solution for q. The explicit solution for each q_l(θ_l) is derived as follows. Assuming that q is subject to the factorization restriction, it follows that

\begin{array}{l} log \underline{p} (y; q) = \int (\prod_{l = 1}^{L} q_{i} (θ_{i})) (log p (y, θ) - \sum_{l = 1}^{L} log q_{i} (θ_{i})) d θ_{1} \dots d θ_{L} \\ = \int q_{1} (θ_{1}) (\int log p (y, θ) q_{2} (θ_{2}) \dots q_{L} (θ_{L}) d θ_{2} \dots d θ_{L}) d θ_{1} - \int q_{1} (θ_{1}) log q_{1} (θ_{1}) d θ_{1} + terms not involving q_{1} \end{array}

Define the joint density function p̃ (y, θ₁) to be

\tilde{p} (y, θ_{1}) \equiv \frac{exp \int log p (y, θ) q_{2} (θ_{2}) \dots q_{M} (θ_{M}) d θ_{2} \dots d θ_{M}}{\int \int {exp \int log p (y, θ) q_{2} (θ_{2}) \dots q_{M} (θ_{M}) d θ_{2} \dots d θ_{M}} d θ_{1} d y}

so that

log \underline{p} (y; q) = \int q_{1} (θ_{1}) log [\frac{\tilde{p} (y, θ_{1})}{q_{1} (θ_{1})}] + terms not involving q_{1} .

Then, using Result 2.1, the optimal q₁ is

\begin{array}{l} q_{1}^{*} (θ_{1}) = \tilde{p} (θ_{1} ∣ y) = \frac{\tilde{p} (y, θ_{1})}{\int \tilde{p} (y, θ_{1}) d θ_{1}} \\ \propto exp [\int log p (y, θ) q_{2} (θ_{2}) \dots q_{L} (θ_{L}) d θ_{2} \dots d θ_{L}] \\ = exp [E_{θ_{- 1}} log p (y, θ)] \end{array}

where E_θ₋₁ log p(y, θ) is the expectation with respect to q₂(θ₂) … q_L(θ_L). The same argument for l in 1, …, L yields optimal densities satisfying

q_{l}^{*} (θ_{l}) \propto exp [E_{θ_{- l}} log p (y, θ)] \propto exp [E_{θ_{- l}} log p (θ_{l} ∣ rest)]

(2.3)

where rest ≡ {y, θ₁, …, θ_l₋₁, θ_l₊₁, …, θ_L} is the collection of all remaining parameters and the observed data. Solving for each factor in terms of the others leads to an iterative algorithm for obtaining a solution for q. We update each factor in turn until the change in p(y; q) is negligible.

2.2. Penalized functional regression

Next we introduce penalized approaches to cross-sectional and longitudinal functional regression [5, 6, 7].

In the cross-sectional case, we observe data of the form [Y_i, X_i(t), z_i] for subjects 1 ≤ i ≤ I, where Y_i is a continuous outcome, X_i(t) is a functional covariate, and z_i is a 1 × p vector of non-functional covariates. The linear functional regression model is given by [4, 21]

\begin{array}{l} Y_{i} = z_{i} β + \int_{0}^{1} X_{i} (t) γ (t) d t + ε_{i}^{Y} \\ ε_{i}^{Y} \sim N (0, σ_{Y}^{2}) . \end{array}

(2.4)

We call the parameter γ(t) the coefficient function. In practice, the predictor functions X_i(t) are observed over a discrete grid, and often with error. That is, we observe {W_i(t_ij): t_ij ∈ [0, 1]} for 1 ≤ i ≤ I and 1 ≤ j ≤ J_i, where $W_{i} (t_{i j}) = X_{i} (t_{i j}) + ε_{i}^{X} (t_{i j})$ and $ε_{i}^{X} (t_{i j}) \sim N (0, σ_{X}^{2})$ . The sampling scheme on which the functional predictors are observed may take a variety of forms: points may be equally or unequally spaced, sparse or dense at the subject level, identical or different across subjects. For simplicity, we will assume that all subjects are observed over the same grid {t₁, …, t_N } and are observed at an equal number of visits J. Extensions to different grids and different number of visits is straightforward, but with considerable increase in notational complexity.

To estimate the parameters in model (2.4), we use the following two-stage procedure. First, the predictor functions X_i(t) are expressed using a principal components (PC) decomposition. Second, the coefficient function γ(t) is estimated using penalized B-splines. Smoothness of γ̂(t) is explicitly induced via a mixed effects model. Specifically, let Σ̂^X(s, t) be an estimator of the covariance operator Cov (X_i(s), X_i(t)) based on the available functional observations. Further, let $\sum_{k = 1}^{\infty} λ_{k} ψ_{k} (s) ψ_{k} (t)$ be the spectral decomposition of Σ̂^X(s, t), where λ₁ ≥ λ₂ ≥ · · · are the non-increasing eigenvalues and ψ(t) = {ψ_k(t): k ∈ Z⁺} are the corresponding orthonormal eigenfunctions. An approximation for X_i(t), based on a truncated Karhunen-Lóeve expansion, is given by $X_{i} (t) = μ (t) + \sum_{k = 1}^{K_{x}} c_{i k} ψ_{k} (t)$ , where K_x is the truncation lag, the PC loadings $c_{i k} = \int_{0}^{1} {X_{i} (t) - μ (t)} ψ_{k} (t) d t$ are uncorrelated random variables with variance λ_k, and μ(t) is the mean function over all subjects and visits.

Next, we use a large cubic B-spline basis to smoothly estimate the coefficient function γ(t) using a mixed effects model. Let φ(t) = {φ₁(t), …, φ_{K_g}(t)} be a cubic B-spline basis of dimension K_g. Then the integral in model (2.4) can be written as $\int_{0}^{1} X_{i} (t) γ (t) d t = a + \int_{0}^{1} c_{i}^{'} ψ^{T} (t) φ (t) g d t = a + c_{i}^{'} Mg$ where $a = \int_{0}^{1} μ (t) γ (t) d t, c_{i}^{'} = [c_{i 1}, \dots, c_{i K_{x}}]$ is the row vector of subject i’s PC loadings, and M is a K_x × K_g matrix with (k, l)^th entry $\int_{0}^{1} ψ_{k} (t) φ_{l} (t) d t$ . Smoothness of γ̂(t) is enforced by assuming a modified first order random walk prior on the vector g [13]. That is, we assume $g_{l} \sim N (g_{l - 1}, σ_{g}^{2})$ for 2 ≤ l ≤ K_g and let $g_{1} \sim N (0, 0.01 σ_{g}^{2})$ . These are standard assumptions in Bayesian P-splines modeling [23, 13]. Taken together, we jointly model the scalar outcome Y_i and the functional exposure X_i(t) using the following model:

\begin{array}{l} Y_{i} & \sim & N (z_{i} β + c_{i}^{'} Mg, σ_{Y}^{2}); σ_{Y}^{2} \sim IG (A_{Y}, B_{Y}) \\ W_{i} (t) & \sim & N (μ (t) + c_{i}^{'} ψ {(t)}^{T}, σ_{X}^{2} I); σ_{X}^{2} \sim IG (A_{X}, B_{X}) \\ c_{i}^{'} & \sim & N (0, Λ); λ_{k} \sim IG (A_{λ}, B_{λ}) for 1 \leq k \leq K_{x} \\ g & \sim & N (0, σ_{g}^{2} D); σ_{g}^{2} \sim IG (A_{g}, B_{g}) \\ β & \sim & N (0, σ_{β}^{2} I) \end{array}

(2.5)

where β are treated as fixed parameters with diffuse priors, D is the covariance matrix induced by the first order random walk prior, and Λ = diag[λ₁, …, λ_{K_x}]. Inference for the functional regression model is based on the posterior density

p (g, C, β, σ_{Y}^{2}, σ_{X}^{2}, σ_{g}^{2}, λ_{1}, \dots, λ_{K_{x}} ∣ Y, W)

(2.6)

where C is the matrix of PC loadings constructed by row-stacking the c_i, $Y = {Y_{i}}_{i = 1}^{I}$ , and W is the matrix of observed predictor functions constructed by row-stacking the W_i(t). Because the functional predictors X_i(t) are observed with error, this model extends Bayesian inference for measurement error regression problems to the functional setting. A directed acyclic graph depicting model (2.5) is presented in Figure 2.

Fig 2 — Directed acyclic graph corresponding to the functional regression model (2.5). Shaded nodes correspond to observed data, and unshaded nodes to model parameters. Arrows indicate conditional dependence. The nodes for $σ_{b}^{2}$ and b, shown as dashed lines, appear in the longitudinal functional regression model (2.8) but not in the cross-sectional model (2.5).

In the longitudinal case, we observe data of the form [Y_ij, X_ij(t), z_ij] for 1 ≤ i ≤ I and 1 ≤ j ≤ J_i. Thus we observe a distinct functional predictor and scalar outcome for each subject over several visits, and again note that in place of the true functional predictors X_ij(t) we often observe a measured-with-error function W_ij(t). The longitudinal functional regression model is given by [7]

\begin{array}{l} Y_{i j} = Z_{i} b + z_{i j} β + \int_{0}^{1} X_{i j} (t) γ (t) d t + ε_{i j}^{Y} \\ ε_{i}^{Y} \sim N (0, σ_{Y}^{2}); \end{array}

(2.7)

this differs from model (2.4) in the use of subject-specific random effects Z_i to account for correlation in the repeated outcomes at the subject level. Moreover, longitudinal data sets tend to be much larger than cross-sectional data sets because of the number of visits.

Given the advent of multiple observational studies collecting dense functional data at multiple visits, the importance of longitudinal functional regression cannot be understated. Unfortunately, with the exception of the work in [7], no other approach can currently deal with the combination of subject-specific random effects and functional predictors necessary to capture the structure of the data. While a wide array of functional regression methods exist, we contend that the specific modeling choices described here made the extension not only possible, but seamless. Estimation of the parameters in the longitudinal setting extends naturally from the procedure outlined for the cross-sectional setting. Again, we express the functional predictors using a PC basis and use a penalized B-spline expansion for the coefficient function. The joint model for the outcome, Y_ij, and exposure, X_ij(t), becomes

\begin{array}{l} Y_{i j} & \sim & N (Z_{i} b + z_{i j} β + c_{i j}^{'} Mg, σ_{Y}^{2}); σ_{Y}^{2} \sim IG (A_{Y}, B_{Y}) \\ W_{i j} (t) & \sim & N (μ (t) + c_{i j}^{'} ψ {(t)}^{T}, σ_{X}^{2} I); σ_{X}^{2} \sim IG (A_{X}, B_{X}) \\ c_{i j}^{'} & \sim & N (0, Λ); λ_{k} \sim IG (A_{λ}, B_{λ}) for 1 \leq k \leq K_{x} \\ g & \sim & N (0, σ_{g}^{2} D); σ_{g}^{2} \sim IG (A_{g}, B_{g}) \\ b & \sim & N (0, σ_{b}^{2} I); σ_{b}^{2} \sim IG (A_{b}, B_{b}) \\ β & \sim & N (0, σ_{β}^{2} I) \end{array}

(2.8)

Again, inference is based on the posterior density

p (g, C, β, b, σ_{Y}^{2}, σ_{X}^{2}, σ_{g}^{2}, σ_{b}^{2}, λ_{1}, \dots, λ_{K_{x}} ∣ Y, W)

(2.9)

A directed acyclic graph of the longitudinal functional regression model appears in Figure 2.

3. Variational approximations for penalized functional regression

We now combine the ideas introduced above to develop a scalable iterative algorithm for approximate Bayesian inference in functional regression. We will focus on the longitudinal functional regression model (2.8); the cross-sectional case can be obtained as a special case by omitting the vector of subject-specific random effects b. We pause briefly to introduce the following useful notation: for a scalar random variables θ, let

\begin{array}{l} μ_{q (θ)} \equiv E_{q} [θ] = \int θ_{0} q (θ_{0}) d θ_{0} \\ σ_{q (θ)} \equiv {Var}_{q} [θ] = \int {(θ_{0} - E [θ])}^{2} q (θ_{0}) d θ_{0} \end{array}

be the mean and variance with respect to the q distribution. For a vector parameter θ, we use the analogously defined μ_q₍_θ₎ and Σ_q₍_θ₎.

As noted, inference in the longitudinal functional regression model is based on the posterior density (2.9). Using variational Bayes, we approximate this posterior density using

q (g, C, β, b, σ_{Y}^{2}, σ_{X}^{2}, σ_{g}^{2}, σ_{b}^{2}, λ_{1}, \dots, λ_{K_{x}}) = q (g) q (C) q (β) q (b) q (σ_{Y}^{2}, σ_{X}^{2}, σ_{g}^{2}, σ_{b}^{2}, λ_{1}, \dots, λ_{K_{x}})

(3.1)

and by solving for each factor q(·) in terms of the remaining factors. The additional factorization

q (σ_{Y}^{2}, σ_{X}^{2}, σ_{g}^{2}, σ_{b}^{2}, λ_{1}, \dots, λ_{K_{x}}) = q (σ_{Y}^{2}) q (σ_{X}^{2}) q (σ_{g}^{2}) q (σ_{b}^{2}) \prod_{j = 1}^{K_{x}} q (λ_{j})

follows as a consequence of (2.3) and the structure of the current model as shown in Figure 2 [3, Sec. 10.2.5]. We take advantage of this induced factorization in deriving optimal densities for the variance components in the penalized functional regression model.

To provide an example of how optimal densities are constructed, we derive the optimal densities q^*(g) and $q^{*} (σ_{g}^{2})$ ; derivations of these and for the other parameters are provided in Appendix A. Recall that $g \sim N (0, σ_{g}^{2} D)$ and $σ_{g}^{2} \sim IG (A_{g}, B_{g})$ . According to (2.3), the optimal densities are given by

q^{*} (g) \propto exp {E_{- g} log p (g ∣ rest)} and q^{*} (σ_{g}^{2}) \propto exp {E_{- σ_{g}^{2}} log p (σ_{g}^{2} ∣ rest)}

where rest includes both the observed data and all parameters not currently under consideration.

Using the full conditional distribution $p (g ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (g ∣ σ_{g}^{2})$ , the optimal density q^*(g) is

\begin{array}{l} q^{*} (g) \propto exp {E_{- g} log p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (g ∣ σ_{g}^{2})} \\ \propto exp [- \frac{1}{2} E_{- g} {g^{T} (\frac{1}{σ_{Y}^{2}} M^{T} C^{T} CM + \frac{1}{σ_{g}^{2}} D^{- 1}) g - 2 ((Y^{T} - β^{T} z^{T} - b^{T} Z^{T}) (\frac{1}{σ_{Y}^{2}} CM)) g}] \\ \propto exp {- \frac{1}{2} {(g - μ_{q (g)})}^{T} \sum_{q (g)}^{- 1} (g - μ_{q (g)})} \end{array}

where

\begin{array}{l} \sum_{q (g)} = {μ_{q (1 / σ_{Y}^{2})} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + n \sum_{q (C)}) M + μ_{q (1 / σ_{g}^{2})} D^{- 1}}^{- 1}, \\ μ_{q (g)} = \sum_{q (g)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (β)}^{T} z^{T} - μ_{q (b)}^{T} Z^{T}) (μ_{q (C)} M)}^{T} . \end{array}

(3.2)

Thus the optimal density q^*(g) is N(μ_q₍_g₎, Σ_q₍_g₎). Similarly, the optimal density $q^{*} (σ_{g}^{2})$ is

\begin{array}{l} q^{*} (σ_{g}^{2}) \propto exp {E_{- g} log p (g ∣ σ_{g}^{2}) p (σ_{g}^{2})} \\ \propto {(σ_{g}^{2})}^{- A_{g} - K_{g} / 2 - 1} exp {- \frac{1}{σ_{g}^{2}} E_{- σ_{g}^{2}} (B_{g} + \frac{1}{2} g^{T} D^{- 1} g)} . \end{array}

Thus $q^{*} (σ_{g}^{2})$ is $IG (A_{g} + K_{g} / 2, B_{q (σ_{g}^{2})})$ where

B_{q (σ_{g}^{2})} = B_{g} + \frac{1}{2} (μ_{q (g)}^{T} D^{- 1} μ_{q (g)} + tr (D^{- 1} \sum_{q (g)}))

(3.3)

Note that, when $q (σ_{g}^{2}) = q^{*} (σ_{g}^{2})$ , the term $μ_{q (1 / σ_{g}^{2})}$ appearing in (3.2) is equal to $\frac{A_{g} + K_{g} / 2}{B_{q (σ_{g}^{2})}}$ .

Thus, the optimal densities q^*(g) and $q^{*} (σ_{g}^{2})$ belong to parametric families with the parameters explicitly determined by the distributions of the remaining model parameters and the observed data. Similar derivations for the parameters of the remaining optimal densities are derived in Appendix A. Taken together, these solutions lead to Algorithm 1 for approximate Bayesian inference in the functional linear regression setting.

Algorithm 1.

Iterative scheme for obtaining the parameters in the optimal densities in the longitudinal functional regression model (2.8).

Initialize:

B_{q (σ_{g}^{2})} \dots B_{q (σ_{Y}^{2})} > 0

, , μ_q₍_C₎ = 0, μ_q₍_g₎ = 0, μ_q₍_β₎ = 0, Σ_q₍_g₎ = I, Λ_q = I.

Cycle:

\begin{array}{l} \sum_{q (β)} \leftarrow {μ_{q (1 / σ_{Y}^{2})} z^{T} z + \frac{1}{σ_{β}^{2}} I}^{- 1} \\ μ_{q (β)} \leftarrow \sum_{q (β)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (b)}^{T} Z^{T} - μ_{q (g)}^{T} M μ_{q (C)}^{T}) z}^{T} \\ \sum_{q (b)} \leftarrow {μ_{q (1 / σ_{Y}^{2})} Z^{T} Z + μ_{q (1 / σ_{b}^{2})} I}^{- 1} \\ μ_{q (b)} \leftarrow \sum_{q (b)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (β)}^{T} z^{T} - μ_{q (g)}^{T} M μ_{q (C)}^{T}) Z}^{T} \\ \sum_{q (C)} \leftarrow {μ_{q (1 / σ_{Y}^{2})} M (μ_{q (g)} μ_{q (g)}^{T} + \sum_{q (g)}) M^{T} + μ_{q (1 / σ_{X}^{2})} ψ^{T} ψ + Λ_{q}^{- 1}}^{- 1} \\ μ_{q (C)}^{T} \leftarrow \sum_{q (C)} {(μ_{q (1 / σ_{Y}^{2})} Y μ_{q (g)}^{T} M^{T} - μ_{q (1 / σ_{Y}^{2})} μ_{q (β)} z μ_{q (g)}^{T} M^{T} μ_{q (1 / σ_{Y}^{2})} Z μ_{q (b)} μ_{q (g)}^{T} M^{T} + μ_{q (1 / σ_{X}^{2})} W ψ)}^{T} \\ \sum_{q (g)} \leftarrow {μ_{q (1 / σ_{Y}^{2})} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + n \sum_{q (C)}) M + μ_{q (1 / σ_{g}^{2})} D^{- 1}}^{- 1} \\ μ_{q (g)} \leftarrow \sum_{q (g)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (β)}^{T} z^{T} - μ_{q (b)}^{T} Z^{T}) (μ_{q (C)} M)}^{T} \\ B_{q (λ_{k})} \leftarrow B_{λ} + \frac{1}{2} ({(μ_{q (C)}^{k})}^{T} (μ_{q (C)}^{k}) + n {(\sum_{q (C)})}_{k k}), 1 \leq j \leq K_{x} \\ B_{q (σ_{X}^{2})} \leftarrow B_{X} + \frac{1}{2} {\sum_{i = 1}^{I} \sum_{j = 1}^{J} {| | {(W_{i j} - μ_{q (c), i j} ψ^{T})}^{T} | |}^{2} + (n J) tr (ψ^{T} ψ \sum_{q (C)})} \\ B_{q (σ_{b}^{2})} \leftarrow B_{b} + \frac{1}{2} (μ_{q (b)}^{T} μ_{q (b)} + tr (\sum_{q (b)})) \\ B_{q (σ_{g}^{2})} \leftarrow B_{g} + \frac{1}{2} (μ_{q (g)}^{T} D^{- 1} μ_{q (g)} + tr (D^{- 1} \sum_{q (g)})) \\ B_{q (σ_{Y}^{2})} \leftarrow B_{Y} + \frac{1}{2} [{| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2} + tr (z^{T} z \sum_{q (β)}) \\ - {(μ_{q (g)} M μ_{q (C)})}^{T} (μ_{q (g)} M μ_{q (C)}) \\ + μ_{q (g)}^{T} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M μ_{q (g)} \\ + tr {M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M \sum_{q (g)}} \\ + tr {Z^{T} Z \sum_{q (b)}}] \end{array}

until the increase in p(Y, W; q) is negligible.

Open in a new tab

Further, as shown in Appendix B, the q-specific lower bound on the marginal log-likelihood has the form

\begin{array}{l} log \underline{p} (Y, W; q) = \frac{1}{2} log (\frac{∣ \sum_{q (β)} ∣}{σ_{β}^{2 p}}) - \frac{1}{2 σ_{β}^{2}} {μ_{q (β)}^{T} μ_{q (β)} + tr (\sum_{q (β)})} \\ + \frac{1}{2} E_{q^{*}} [log (∣ \sum_{q (g)} ∣)] + \sum_{k = 1}^{K_{x}} \frac{n J}{2} log ({(\sum_{q (C)})}_{k k}) \\ + \frac{1}{2} E_{q^{*}} [log (∣ \sum_{q (b)} ∣)] - (A_{g} + \frac{K_{g}}{2}) log (B_{q (σ_{g}^{2})}) \\ - (A_{b} + \frac{n}{2}) log (B_{q (σ_{b}^{2})}) - (A_{Y} + \frac{n J}{2}) log (B_{q (σ_{Y}^{2})}) \\ - (A_{X} + \frac{nJN}{2}) log (B_{q (σ_{X}^{2})}) \\ - \sum_{k = 1}^{K_{x}} {(A_{λ} + \frac{n J}{2}) log (B_{q (λ_{k})})} + const . \end{array}

where const. is an additive constant that remains unchanged in the iterations of Algorithm 1. All parameters denoted A and B and indexed by a subscript are hyperparameters of the inverse gamma prior distributions of the variance components. The quantity log p(Y, W; q) is typically monitored for convergence in place of p(Y, W; q). Note that, because several substitutions are made to simplify the expression, this form for log p(Y, W; q) is only valid at the end of each iteration of Algorithm 1, and only if the parameters are updated in the order given.

Finally, posterior credible intervals are readily obtained for all model parameters. However, variational approximations in effect fit a parametric distribution to a mode of the posterior density, which may have consequences when the posterior is multi-modal or more diffuse than the approximating parametric distribution; in such cases one could expect that credible intervals from variational Bayes and MCMC sampling may not agree. This was not a problem in our simulations, where the agreement between the approximate and MCMC-sampled posterior distribution is high, although in our application the variational credible intervals are slightly narrower than those from MCMC sampling.

4. Simulations

In this section we undertake simulation exercises with two goals. First, we evaluate our approach’s overall ability to accurately estimate all coefficients in a functional regression model. Second, we compare the individual approximate posterior distributions q^*(θ_l) ≈ p(θ_l | rest) to those given by Markovchain Monte Carlo (MCMC) sampling in order to examine the quality of the variational approximation in the functional regression setting. We conduct separate simulations for the cross-sectional and longitudinal situations. The MCMC sampling was executed in WinBUGS and the variational Bayes approach was implemented in R.

4.1. Cross-sectional functional regression

We generate samples from the model

\begin{array}{l} Y_{i} = β_{1} + z_{i} β_{2} + \int_{0}^{1} X_{i}^{S} (t) γ (t) d t + ε_{i}^{Y} \\ ε_{i}^{Y} \sim N (0, σ_{Y}^{2}) . \end{array}

(4.1)

Here we assume I = {100, 500} subjects and generate z_i ~ Unif [−5, 5]. We take $σ_{Y}^{2} = 5, β_{1} = \int_{0}^{1} μ (t) γ (t) d t = 3.47$ , β₂ = 3, and γ(t) = cos(2πt).

To generate our simulated functional predictors $X_{i}^{S} (t)$ , we use the functional predictors $X_{i}^{A} (t)$ from our scientific application in the following way. First, we compute a functional principal components decomposition of the $X_{i}^{A} (t)$ with eigenfunctions ψ₁(t), ψ₂(t), … and corresponding eigenvalues λ₁, λ₂, …. Recall that the application predictors can be approximated using $X_{i}^{A} (s) = μ (t) + \sum_{k = 1}^{K_{x}} c_{i k} ψ_{k} (t)$ where μ(t) is a population mean function, K_x is the truncation lag and the c_ik are uncorrelated random variables with variance λ_k. Using this, we construct simulated regressors

\begin{array}{l} X_{i}^{S} (t) = μ (t) + \sum_{k = 1}^{K_{x}} c_{i k} ψ_{k} (t) + ε_{i}^{X} (t) \\ c_{i} \sim N (0, diag (λ_{1}, \dots, λ_{K_{x}})); ε_{i}^{X} (t) \sim N (0, σ_{X}^{2}) . \end{array}

This parametric construction of the simulated functional predictors is related to the application predictors through the mean function μ(t), the eigenfunctions ψ_k(t), and the variance components λ_k. As in our application, the simulated predictors are observed on a grid of length 93.

We generate 100 such datasets for I = 100 and I = 500 and fit model (4.1) using both MCMC simulation and the variational approximation approach. For the MCMC simulation, we use chains of length 2500 with the first 1000 as burn-in. Representative examples of the MCMC model fits were inspected using trace and autocorrelation plots to ensure that the posterior samples were reasonable and that the comparison with variational Bayes was fair. To evaluate the ability of the proposed approach to estimate the functional coefficient γ(t) we use the mean squared error (MSE) $\int_{0}^{1} {(\hat{γ} (t) - γ (t))}^{2} d t$ . A comparison of MSEs for the variational approach with the more computationally intensive MCMC sampling is given in table 1. To provide context for this table, in the left panel of Figure 3 we plot the estimated coefficient function resulting in the median MSE for I = 100.

Table 1.

Average integrated MSE for γ(t) and average MSE for the non-functional covariates β₁, β₂ estimated using the variational approximation, taken over 100 simulated datasets. For I = 500, large outlier for the MCMC MSE were removed in the calculation of the average

		γ(t)	β₁	β₂
I = 100	VB	.050	.071	.051
	MCMC	.054	.071	.051

I = 500	VB	.046	.008	.001
	MCMC	.120	.008	.001

Open in a new tab

Fig 3 — The top left panel shows the estimated coefficient function corresponding to the median MSE = 0.049, as well as variational and MCMC 95% credible intervals (dashed lines) and the 95% bootstrap interval (shaded region). The top-right panel displays the coverage probabilities of the credible intervals over the domain of the predictor (note there is perfect overlap of the VB and MCMC coverage probabilities). The bottom panels show posterior densities estimated by variational approximations and by MCMC sampling from the same simulated dataset (shown in dashed and solid lines, respectively), and provide the accuracy of the approximation expressed as a percent.

Interestingly, when I = 500 the MSEs for MCMC sampling contain several large outliers, which raises the average MSE for the coefficient function in Table 1. Upon inspection, it was found that these large values corresponded to model fits in which the chains for g were bimodal. A large primary mode surrounded the true parameter value but a smaller, more diffuse mode corresponded to a model overfit. We refit these models using as initial parameter values the estimates provided by the variational approach, which caused the bimodal behavior to disappear and brought the MSEs (and average MSE) down to levels similar to the remaining model fits.

We also quantify the quality of the variational approximation to the MCMC-sampled posterior by computing the accuracy for each parameter in the model using $Accuracy = 1 - \frac{∣ p (θ_{l} ∣ rest) - q^{*} (θ_{l}) ∣}{2} \in [0, 1]$ ; scores near 1 indicate a high 2 level of agreement between the two densities. Due to the large number of parameters in the model, we present only a subset of the average accuracies in Table 2. As with the MSE, context for this table in given in Figure 3. The accuracy of g is affected by the presence of outliers, attributable to the same bimodal MCMC samples that caused the very large MSEs appearing in Table 1.

Table 2.

The accuracy of the variational approximation to the MCMC-sampled posterior, expressed as a percentage, for a subset of parameters in the cross-sectional functional regression model (2.5)

Accuracy

g₅

g₂₀

c_1,1

c_1,10

λ_1,1

λ_1,10

σ_{Y}^{2}

I = 100

96.3

95.1

98.3

98.0

96.9

97.2

95.0

I = 500

86.7

82.6

97.6

97.8

97.6

88.3

96.3

Open in a new tab

Due to the substantial decrease in computation time using variational Bayes over MCMC methods, we are able to construct 95% bootstrap confidence intervals by sampling subjects with replacement and refitting model (4.1) using the variational approach. While credible intervals provided by MCMC or by a single variational fit are overly conservative, the bootstrap intervals are on average .46 times narrower and, averaged over the domain, have coverage probability 93.4% for I = 100 and 93.6% for I = 500. The far right panel of Figure 3 displays the coverage probabilities of the various credible and confidence intervals for I = 500.

As demonstrated in Table 2 and Figure 3, the variational approximation performs well in this functional regression setting, both in terms of low MSEs and of agreement the with MCMC-sample posterior density. This stems from the low posterior dependence between the parameters, which is assumed in the use of the density transform approach. Additionally, the use of the bootstrap allows the construction of confidence intervals that are not overly conservative.

Importantly, even in this simulation the computational burden is greatly reduced through the use of variational approximations. For I = 100, the MCMC sampling took on average 315 seconds, while the approximation was computed in on average 0.04 seconds (Dual Core 3.06GHz Processor; 4 GB RAM; OS X 10.6.4). For I = 500, the respective times were 1614 and 0.2 seconds. Constructing the bootstrap confidence intervals, based on 400 bootstrap samples, took on average an additional 20 and 76 seconds for I = 100 and I = 500, respectively.

4.2. Longitudinal functional regression

Next, we generate samples from the model

\begin{array}{l} Y_{i j} = β_{1} + z_{i j} β_{2} + b_{i} + \int_{0}^{1} X_{i j}^{S} (t) γ (t) d t + ε_{i j}^{Y} \\ ε_{i j}^{Y} \sim N (0, σ_{Y}^{2}) . \end{array}

(4.2)

We take I = 100 subjects with J = 3 visits per subject; random effects b are $N (0, σ_{b}^{2})$ with $σ_{b}^{2} = 5$ . Again, we generate z_i ~ Unif [−5, 5], take $σ_{Y}^{2} = 5, β_{1} = \int_{0}^{1} μ (t) γ (t) d t = 12.68$ , β₂ = 3, and select γ(t) = cos(2πt). The functional predictors $X_{i j}^{S} (t)$ are constructed as above; we take c_ij ~ N (0, diag(λ₁, …, λ_{ijK_x})) so that the PC loadings are not correlated within subjects.

We fit model (2.8) for 100 simulated datasets. As in the cross-sectional case, we use chain lengths of 2500, with 1000 as burn-in, for the MCMC sampling. In Table 3 we display the average MSE of the estimated functional and scalar parameters and the subject-specific random effects. Again, the variational approximation performs as well as the MCMC sampling with a substantial difference in computation time: the MCMC sampling took on average 973 seconds, while the approximation was calculated in on average .2 seconds.

Table 3.

Average integrated MSE for γ(t) and average MSE for the non-functional covariates β₁, β₂ estimated using the variational approximation, taken over 100 simulated datasets

		γ(t)	β₁	β₂
I = 100, J = 3	VB	.026	.0003	.0002
	MCMC	.030	.0002	.0002

Open in a new tab

5. Application

In our scientific application, we analyze the association between measures of intracranial white matter and cognitive decline in multiple sclerosis patients. White matter is made up of myelinated axons, the long fibers used to transmit electrical signals in the brain, and is organized into bundles, or tracts. Major examples of white matter tracts are the corpus callosum, the corticospinal tracts, and the optic radiations. Here we focus on the corpus callosum, a collection of white-matter fibers which connects the two hemispheres of the brain.

Myelin, the fatty insulation surrounding white matter fibers, allows electrical signals to be propagated at high speeds along white matter tracts. Multiple sclerosis is a demyelinating autoimmune disease that causes in lesions in the white matter. These lesions disrupt electrical signals, and, over time, result in severe disability in affected patients. To measure cognitive disability, we use the Paced Auditory Serial Addition Test (PASAT), which assesses auditory processing speed and calculation ability. In this test, a proctor reads aloud a sequence of 60 numbers at three-second intervals, while the subject provides the sum of the previous two numbers spoken. This test has scores between 0 and 60 indicating the number of correct sums provided by the subject; the score 60 indicates the highest level of cognitive ability [8].

To quantify white matter, we use diffusion tensor imaging, a magnetic resonance imaging that measures the diffusivity of water in the brain. Because white matter is organized in bundles, water tends to diffuse anisotropically along the tract, which makes their reconstruction from MRI possible. By measuring diffusivity along several gradients, diffusion tensor imaging is able to produce detailed images of intracranial white matter [1, 2, 14, 17]. Moreover, continuous summaries of individual white matter tracts, parameterized by distance along the tract and called tract profiles, can be constructed from diffusion tensor images. Here we study the fractional anisotropy tract profile of the right corticospinal tract; this gives a measure of how anisotropic diffusion is along the tract.

Our study consists of 100 multiple sclerosis patients with between two and eight visits each; a total of 334 visits were observed. Study participants had ages between 21 and 71 years, and 63% were women. We fit model (2.8), using age and gender as non-functional covariates and the mean diffusivity tract profile of the corpus callosum as a functional predictor. We include subject-specific random intercepts to account for the repeated observations at the subject level. The model was fit using both the variational approximation and MCMC sampling; the results are shown in Figure 4.

Fig 4 — Results of fitting model (2.8) to the diffusion tensor imaging dataset. The left panel shows the estimated coefficient function; credible intervals for both methods are shown in dashed lines, and the nonparametrically bootstrapped interval shown in grey; the right panel shows the random intercepts predicted by both variational Bayes and MCMC sampling.

Previous studies have linked damage in the corpus callosum to cognitive decline as measured by PASAT and other tests [15, 20]. However, these studies lacked the spatial information present in the functional treatment here, which proves to be important. From the estimated coefficient function and bootstrapped confidence intervals, we see that the region from roughly 0 to .2 is negatively associated with the PASAT outcome – that is, subjects with above-average mean diffusivity in these regions tend to have lower PASAT scores. A second region, from .65 to .8, is positively associated with the outcome. We base inference on the bootstrapped intervals due to the overly conservative coverage of the MCMC and variational Bayes credible intervals; however, there is broad agreement between all intervals regarding the location of regions of interest. Note the interpretation of the coefficient function is marginal, rather than conditional on a subject’s random intercept. The random intercepts are an important component of the model: a model including only random intercepts explains roughly 80% of the outcome variability, while adding functional and nonfunctional covariates raises this to 89%. Finally, age and gender were not found to be statistically significant, but were retained as scientifically important covariates. Their inclusion did not meaningfully affect the shape or significance of the functional predictor.

There is broad agreement between the variational Bayes and MCMC model fits: the point estimates of the coefficient and the random intercepts are very similar, and the credible intervals indicate the same regions of significance. On the other hand, the credible interval using MCMC is wider than that using variational Bayes. As noted above, the variational method can result in narrower confidence intervals if the approximating density is less diffuse than the MCMC-sampled posterior which appears to be the case here. In this application, we posit that the lesser importance of the functional predictors in comparison to the random intercepts leads to increased posterior variability in the estimated functional coefficient. Indeed, when we fit a model without the random subject-specific intercept the confidence intervals for Bayesian and variational Bayesian approximations became indistinguishable.

Also shown in Figure 4 as a grey band is the 95% bootstrap confidence interval, constructed by nonparametrically resampling subjects and fitting the longitudinal functional regression model using variational Bayes. Inference for the coefficient function is largely unchanged based on the bootstrap interval except in the region from .2 to .4, which does not appear to be significantly associated with the outcome. Although the credible intervals using variational Bayes are likely too narrow, the computational gain and accurate point estimates provided by this method allow for the construction of bootstrap confidence intervals, which performed much better in our simulations.

6. Discussion

The variational Bayes approach to functional regression was motivated by a pressing need for computationally feasible Bayesian inference in a large-data setting. We have developed iterative algorithms for approximate inference in both the cross-sectional and longitudinal regression settings, and analyzed a longitudinal neuroimaging study. The methods developed: 1) flexibly estimate all parameters in the cross-sectional and longitudinal functional regression models; 2) accurately approximate the posterior distributions of all model parameters; 3) retain the advantages of Bayesian inference, including the ability to jointly model the functional predictors and scalar outcomes and easily constructed credible bands; 4) require orders of magnitude less computational effort than MCMC techniques; and 5) allow the construction of nonparametric bootstrap confidence intervals, which seem to have good coverage probabilities.

A few limitations of the variational Bayes method are apparent. While our simulations indicate that variational techniques can be used with confidence in the functional regression setting, the approximation cannot be made more accurate by increasing computation time. Additionally, the iterative algorithms are based on involved algebraic derivations; those needed for functional regression have been carried out here, but additional work may be needed to adapt these algorithms to specific scientific settings. Lastly, the performance of credible intervals approximated using variational Bayes may not be satisfactory if the posterior distribution is multimodal or more diffuse than the approximating distribution, although the use of the nonparametric bootstrap can alleviate this issue.

Future work may proceed in several directions. The adaptation of the approach to non-Gaussian outcomes will expand the class of applications in which variational Bayes may be used for functional regression. Very large gains in computation time may be found in functional magnetic resonance imaging or other studies where the predictors are sampled at thousands or tens of thousands of points. More generally, variational Bayes has potential applications in several functional data analysis topics, including function-on-function regression and the decomposition of populations of functions.

Appendix A: Derivations

In this appendix we derive the optimal densities q^* for approximate Bayesian inference in the longitudinal functional regression model. For the cross-sectional case, one may omit the random effects b. We recall that, given a partition {θ₁, …, θ_L} of the parameter space θ, the explicit solution for q(θ_l), 1 ≤ l ≤ L, has the form

q_{l}^{*} (θ_{l}) \propto exp {E_{θ_{- l}} log p (θ_{l} ∣ rest)}; 1 \leq l \leq L

(A.1)

where rest ≡ {y, θ₁, …, θ_l₋₁, θ_l₊₁, …, θ_L}

A.1. Optimal densities for g and $σ_{g}^{2}$

Recall that $g \sim N (0, σ_{g}^{2} D)$ and $σ_{g}^{2} \sim IG (A_{g}, B_{g})$ . According to (A.1), the optimal densities are given by

q^{*} (g) \propto exp {E_{- g} log p (g ∣ rest)} and q^{*} (σ_{g}^{2}) \propto exp {E_{- σ_{g}^{2}} log p (σ_{g}^{2} ∣ rest)} .

The full conditional distribution p(g|rest) appearing above is given by

\begin{array}{l} p (g ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (g ∣ σ_{g}^{2}) \\ \propto exp [- \frac{1}{2} {\frac{1}{σ_{Y}^{2}} {(Y - z β - Zb - CMg)}^{T} (Y - z β - Zb - CMg) + \frac{1}{σ_{g}^{2}} g^{T} D^{- 1} g}] \\ \propto exp [- \frac{1}{2} {g^{T} (\frac{1}{σ_{Y}^{2}} M^{T} C^{T} CM + \frac{1}{σ_{g}^{2}} D^{- 1}) g \\ - 2 ((Y^{T} - β^{T} z^{T} - Zb) (\frac{1}{σ_{Y}^{2}} CM)) g}] . \end{array}

Therefore the optimal density q^*(g) is

\begin{array}{l} q^{*} (g) \propto exp [- \frac{1}{2} E_{- g} {g^{T} (\frac{1}{σ_{Y}^{2}} M^{T} C^{T} CM + \frac{1}{σ_{g}^{2}} D^{- 1}) g - 2 ((Y^{T} - β^{T} z^{T} - b^{T} Z^{T}) (\frac{1}{σ_{Y}^{2}} CM)) g}] \\ \propto exp {- \frac{1}{2} {(g - μ_{q (g)})}^{T} \sum_{q (g)}^{- 1} (g - μ_{q (g)})} \end{array}

where

\begin{array}{l} \sum_{q (g)} = {μ_{q (1 / σ_{Y}^{2})} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + n \sum_{q (C)}) M + μ_{q (1 / σ_{g}^{2})} D^{- 1}}^{- 1} \\ μ_{q (g)} = \sum_{q (g)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (β)}^{T} z^{T} - μ_{q (b)}^{T} Z^{T}) (μ_{q (C)} M)}^{T} . \end{array}

(A.2)

Thus the optimal density q^*(g) is N(μ_q₍_g₎, Σ_q₍_g₎).

Further, the full conditional $p (σ_{g}^{2} ∣ rest)$ is given by

\begin{array}{l} p (σ_{g}^{2} ∣ rest) \propto p (g ∣ σ_{g}^{2}) p (σ_{g}^{2}) \\ \propto {(σ_{g}^{2})}^{- K_{g} / 2} exp {- \frac{1}{2 σ_{g}^{2}} g^{T} D^{- 1} g} \times {(σ_{g}^{2})}^{- A_{g} - 1} exp {- \frac{1}{σ_{g}^{2}} B_{g}} \\ = {(σ_{g}^{2})}^{- A_{g} - K_{g} / 2 - 1} exp {- \frac{1}{σ_{g}^{2}} (B_{g} + \frac{1}{2} g^{T} D^{- 1} g)} \end{array}

so that the optimal density $q^{*} (σ_{g}^{2})$ is

q^{*} (σ_{g}^{2}) \propto exp {- (A_{g} + K_{g} / 2 + 1) log (σ_{g}^{2}) - \frac{1}{σ_{g}^{2}} E_{- σ_{g}^{2}} (B_{g} + \frac{1}{2} g^{T} D^{- 1} g)} .

Thus $q^{*} (σ_{g}^{2})$ is $IG (A_{g} + K_{g} / 2, B_{q (σ_{g}^{2})})$ where

B_{q (σ_{g}^{2})} = B_{g} + \frac{1}{2} (μ_{q (g)}^{T} D^{- 1} μ_{q (g)} + tr (D^{- 1} \sum_{q (g)}))

(A.3)

Note that, when $q (σ_{g}^{2}) = q^{*} (σ_{g}^{2})$ , the term $μ_{q (1 / σ_{g}^{2})}$ appearing in (A.2) is equal to $\frac{A_{g} + K_{g} / 2}{B_{q (σ_{g}^{2})}}$ .

A.2. Optimal densities for b and $σ_{b}^{2}$

Recall that $b \sim N (0, σ_{b}^{2} I)$ and $σ_{b}^{2} \sim IG (A_{b}, B_{b})$ .

The full conditional distribution p(b|rest) is given by

\begin{array}{l} p (b ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (b ∣ σ_{b}^{2}) \\ \propto exp [- \frac{1}{2} {\frac{1}{σ_{Y}^{2}} {(Y - z β - Zb - CMg)}^{T} (Y - z β - Zb - CMg) + \frac{1}{σ_{b}^{2}} b^{T} I^{- 1} b}] \\ \propto exp [- \frac{1}{2} {b^{T} (\frac{1}{σ_{Y}^{2}} Z^{T} Z + \frac{1}{σ_{b}^{2}} I) b - 2 ((Y^{T} - β^{T} z^{T} - g^{T} M^{T} C^{T}) (\frac{1}{σ_{Y}^{2}} Z)) b}] \end{array}

Therefore, by (A.1), the optimal density q^*(b) is

q^{*} (b) \propto exp [- \frac{1}{2} E_{- b} {b^{T} (\frac{1}{σ_{Y}^{2}} Z^{T} Z + \frac{1}{σ_{b}^{2}} I) b - 2 ((Y^{T} - β^{T} z^{T} - g^{T} M^{T} C^{T}) (\frac{1}{σ_{Y}^{2}} Z)) b}] .

After taking the expectation above, the optimal density q^*(b) is N(μ_q₍_b₎, Σ_q₍_b₎) where

\begin{array}{l} \sum_{q (b)} = {μ_{q (1 / σ_{Y}^{2})} Z^{T} Z + μ_{q (1 / σ_{b}^{2})} I}^{- 1} \\ μ_{q (b)} = \sum_{q (b)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (β)}^{T} z^{T} - μ_{q (g)}^{T} M μ_{q (C)}^{T}) Z}^{T} . \end{array}

(A.4)

Further, the full conditional $p (σ_{b}^{2} ∣ rest)$ is given by

\begin{array}{l} p (σ_{b}^{2} ∣ rest) \propto p (b ∣ σ_{b}^{2}) p (σ_{b}^{2}) \\ \propto {(σ_{b}^{2})}^{- n / 2} exp {- \frac{1}{2 σ_{b}^{2}} b^{T} b} \times {(σ_{b}^{2})}^{- A_{b} - 1} exp {- \frac{1}{σ_{g}^{2}} B_{b}} \\ = {(σ_{b}^{2})}^{- A_{b} - n / 2 - 1} exp {- \frac{1}{σ_{b}^{2}} (B_{b} + \frac{1}{2} b^{T} b)} \end{array}

so that the optimal density $q^{*} (σ_{b}^{2})$ is

q^{*} (σ_{b}^{2}) \propto exp {- (A_{b} + n / 2 + 1) log (σ_{b}^{2}) - \frac{1}{σ_{b}^{2}} E_{- σ_{b}^{2}} (B_{b} + \frac{1}{2} b^{T} b)} .

Thus $q^{*} (σ_{b}^{2})$ is $IG (A_{b} + n / 2, B_{q (σ_{b}^{2})})$ where

B_{q (σ_{b}^{2})} = B_{b} + \frac{1}{2} (μ_{q (b)}^{T} μ_{q (b)} + tr (\sum_{q (b)}))

(A.5)

Note that, when $q (σ_{b}^{2}) = q^{*} (σ_{b}^{2})$ , the term $μ_{q (1 / σ_{b}^{2})}$ appearing in (A.4) is equal to $\frac{A_{b} + n / 2}{B_{q (σ_{b}^{2})}}$ .

A.3. Optimal densities for C and λ_j

Recall that c_ij ~ N (0, Λ), where Λ = diag(λ₁, …, λ_{K_x}) and λ_k ~ IG (A_λ, B_λ) for 1 ≤ k ≤ K_x. In the following, we continue to use c_ij as the PC loadings for subject i at visit j and C as the matrix constructed by row-stacking the c_ij. We additionally use μ_q₍_c_),_ij as the expected value of c_ij with respect to the q(C) distribution and μ_q₍_C₎ as the matrix constructed by row-stacking the μ_q₍_c_),_ij. Finally, let $Λ_{q}^{- 1} = diag (μ_{q (1 / λ_{1})}, \dots, μ_{q (1 / λ_{k})})$

The full conditional distribution p(C|rest) is given by

\begin{array}{l} p (C ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (W ∣ C, σ_{X}^{2}) p (C ∣ Λ) \\ \propto exp [- \frac{1}{2} tr {\frac{1}{σ_{Y}^{2}} {(Y^{T} - β^{T} z^{T} - b^{T} Z^{T} - g^{T} M^{T} C^{T})}^{T} (Y^{T} - β^{T} z^{T} - b^{T} Z^{T} - g^{T} M^{T} C^{T})}] \times exp [- \frac{1}{2} tr {\frac{1}{σ_{X}^{2}} {(W^{T} - ψ C^{T})}^{T} (W^{T} - ψ C^{T})}] \times exp {- \frac{1}{2} tr (C Λ^{- 1} C^{T})} \\ \propto exp [- \frac{1}{2} tr {C (\frac{{Mg g}^{T} M^{T}}{σ_{Y}^{2}} + \frac{ψ^{T} ψ}{σ_{X}^{2}} + Λ^{- 1}) C^{T} - 2 (\frac{{Yg}^{T} M^{T}}{σ_{Y}^{2}} - \frac{z β g^{T} M^{T}}{σ_{Y}^{2}} - \frac{Zb g^{T} M^{T}}{σ_{Y}^{2}} + \frac{W ψ}{σ_{X}^{2}}) C^{T}}] \end{array}

Therefore, by (A.1), the optimal density q^*(C) is

\begin{array}{l} q^{*} (C) \propto exp [- \frac{1}{2} E_{- C} [tr {C (\frac{{Mgg}^{T} M^{T}}{σ_{Y}^{2}} + \frac{ψ^{T} ψ}{σ_{X}^{2}} + Λ^{- 1}) C^{T} - 2 (\frac{{Yg}^{T} M^{T}}{σ_{Y}^{2}} - \frac{z β g^{T} M^{T}}{σ_{Y}^{2}} - \frac{Zb g^{T} M^{T}}{σ_{Y}^{2}} + \frac{W ψ}{σ_{X}^{2}}) C^{T}}]] \\ \propto exp {- \frac{1}{2} [\sum_{i = 1}^{I} \sum_{j = 1}^{J} {{(c_{i j})}^{T} - {(μ_{q (c), i j})}^{T}}^{T} \sum_{q (C)}^{T} {{(c_{i j})}^{T} - {(μ_{q (c), i j})}^{T}}]} \end{array}

where

\begin{array}{l} \sum_{q (C)} = {μ_{q (1 / σ_{Y}^{2})} M (μ_{q (g)} μ_{q (g)}^{T} + \sum_{q (g)}) M^{T} + μ_{q (1 / σ_{X}^{2})} ψ^{T} ψ + Λ_{q}^{- 1}}^{- 1} \\ μ_{q (C)}^{T} = \sum_{q (C)} {(μ_{q (1 / σ_{Y}^{2})} Y μ_{q (g)}^{T} M^{T} - μ_{q (1 / σ_{Y}^{2})} μ_{q (β)} z μ_{q (g)}^{T} M^{T} - μ_{q (1 / σ_{Y}^{2})} Z μ_{q (b)} μ_{q (g)}^{T} M^{T} + μ_{q (1 / σ_{X}^{2})} W ψ)}^{T} . \end{array}

(A.6)

Thus the optimal density q^*(C) is a product of Normally distributed random vectors sharing a common covariance matrix and with means the rows of μ_q₍_C₎.

In the derivation of the optimal density q^*(λ_k), 1 ≤ k ≤ K_x, we let C^k denote the k^th column of C and $μ_{q (C)}^{k}$ denote the k^th column of μ_q₍_C₎. Further, we let (Σ_q₍_C₎)_kk denote the (k, k)^th element of Σ_q₍_C₎. The full conditional p(λ_k|rest) is given by

\begin{array}{l} p (λ_{k} ∣ rest) \propto p (C^{k} ∣ λ_{k}) p (λ_{k}) \\ \propto {(λ_{k})}^{- (n J) / 2} exp {- \frac{1}{2 λ_{k}} {(C^{k})}^{T} (C^{k})} \times {(λ_{k})}^{- A_{λ} - 1} exp {- \frac{1}{λ_{k}} B_{λ}} \\ = {(λ_{k})}^{- A_{λ} - (n J) / 2 - 1} exp {- \frac{1}{λ_{k}} (B_{λ} + \frac{1}{2} {(C^{k})}^{T} (C^{k}))} \end{array}

so that the optimal density q^*(λ_k) is

q^{*} (λ_{k}) \propto exp {- (A_{λ} + (n J) / 2 + 1) log (λ_{k}) - \frac{1}{λ_{k}} E_{- λ_{k}} (B_{λ} + \frac{1}{2} {(C^{k})}^{T} (C^{k}))} .

Thus q^*(λ_k) is IG(A_λ + (nJ)/2, B_{q(λ_k)} where

B_{q (λ_{k})} = B_{λ} + \frac{1}{2} ({(μ_{q (C)}^{k})}^{T} (μ_{q (C)}^{k}) + n {(\sum_{q (C)})}_{k k})

(A.7)

Note that, when q(λ_k) = q*(λ_k), the term (k, k)^th entry of $Λ_{q}^{- 1}$ appearing in (A.6) is equal to $\frac{A_{λ} + (n J) / 2}{B_{q (λ_{k})}}$ .

A.4. Optimal density for β

Recall that $β \sim N (0, σ_{β}^{2} I)$ .

The full conditional distribution p(β|rest) is given by

\begin{array}{l} p (β ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (β) \\ \propto exp [- \frac{1}{2} {\frac{1}{σ_{Y}^{2}} {(Y - z β - Zb - CMg)}^{T} (Y - z β - Zb - CMg) + \frac{1}{σ_{β}^{2}} β^{T} I^{- 1} β}] \\ \propto exp [- \frac{1}{2} {β^{T} (\frac{1}{σ_{Y}^{2}} z^{T} z + \frac{1}{σ_{β}^{2}} I) β - 2 ((Y^{T} - b^{T} Z^{T} - g^{T} M^{T} C^{T}) (\frac{1}{σ_{Y}^{2}} z)) β}] \end{array}

Therefore, by (A.1), the optimal density q^*(β) is

q^{*} (β) \propto exp [- \frac{1}{2} E_{- β} {β^{T} (\frac{1}{σ_{Y}^{2}} z^{T} z + \frac{1}{σ_{β}^{2}} I) β - 2 ((Y^{T} - b^{T} Z^{T} - g^{T} M^{T} C^{T}) (\frac{1}{σ_{Y}^{2}} z)) β}] .

After taking the expectation above, the optimal density q^*(β) is N(μ_q₍_β₎, Σ_q₍_β₎) where

\begin{array}{l} \sum_{q (β)} = {μ_{q (1 / σ_{Y}^{2})} z^{T} z + \frac{1}{σ_{β}^{2}} I}^{- 1} \\ μ_{q (β)} = \sum_{q (β)} {μ_{q (1 / σ_{Y}^{2})} (Y^{T} - μ_{q (b)}^{T} Z^{T} - μ_{q (g)}^{T} M μ_{q (C)}^{T}) z}^{T} . \end{array}

(A.8)

A.5. Optimal density for $σ_{X}^{2}$

Recall that the functional predictors are observed over a grid of length N. The full conditional $p (σ_{X}^{2} ∣ rest)$ is given by

\begin{array}{l} p (σ_{X}^{2} ∣ rest) \propto p (W ∣ σ_{X}^{2}) p (σ_{X}^{2}) \\ \propto [\prod_{i = 1}^{I} \prod_{j = 1}^{J} {(σ_{X}^{2})}^{- N / 2} exp {- \frac{1}{2 σ_{X}^{2}} (W_{i j} - C_{i j} ψ^{T}) {(W_{i j} - C_{i j} ψ^{T})}^{T}}] \cdot [{(σ_{X}^{2})}^{- A_{X} - 1} exp {- \frac{1}{σ_{X}^{2}} B_{X}}] \\ = {(σ_{X}^{2})}^{- A_{X} - (nJN) / 2 - 1} exp {- \frac{1}{σ_{X}^{2}} (B_{X} + \frac{1}{2} \sum_{i = 1}^{I} \sum_{j = 1}^{J} (W_{i j} - C_{i j} ψ^{T}) {(W_{i j} - C_{i j} ψ^{T})}^{T})} \end{array}

so that the optimal density $q^{*} σ_{X}^{2}$ is

q^{*} (σ_{X}^{2}) \propto exp {- A_{X} + (nNJ / 2 + 1) log (σ_{X}^{2}) - \frac{1}{σ_{X}^{2}} (B_{X} + \frac{1}{2} E_{- σ_{X}^{2}} \sum_{i = 1}^{I} \sum_{j = 1}^{J} (W_{i j} - C_{i j} ψ^{T}) {(W_{i j} - C_{i j} ψ^{T})}^{T})} .

Thus $q^{*} (σ_{X}^{2})$ is $IG (A_{X} + (nNJ) / 2, B_{q (σ_{X}^{2})})$ where

B_{q (σ_{X}^{2})} = B_{X} + \frac{1}{2} {\sum_{i = 1}^{I} \sum_{j = 1}^{J} {| | {(W_{i j} - μ_{q (c), i j} ψ^{T})}^{T} | |}^{2} + (n J) tr (ψ^{T} ψ \sum_{q (C)})}

(A.9)

Note that, when $q (σ_{X}^{2}) = q^{*} (σ_{X}^{2})$ , the term $μ_{q (1 / σ_{X}^{2})}$ appearing in (A.6) is equal to $\frac{A_{X} + n J / 2}{B_{q (σ_{X}^{2})}}$ .

A.6. Optimal density for $σ_{Y}^{2}$

Finally, the full conditional $p (σ_{Y}^{2} ∣ rest)$ is given by

\begin{array}{l} p (σ_{Y}^{2} ∣ rest) \propto p (Y ∣ β, b, C, g, σ_{Y}^{2}) p (σ_{Y}^{2}) \\ \propto {(σ_{Y}^{2})}^{- (n J) / 2} exp {- \frac{1}{2 σ_{Y}^{2}} {| | Y - z β - Zb - CMg | |}^{2}} \times {(σ_{Y}^{2})}^{- A_{Y} - 1} exp {- \frac{1}{σ_{Y}^{2}} B_{Y}} \\ = {(σ_{Y}^{2})}^{- A_{Y} / (n J) / 2 - 1} exp {- \frac{1}{σ_{Y}^{2}} (B_{Y} + \frac{1}{2} {| | Y - z β - Zb - CMg | |}^{2})} \end{array}

so that the optimal density $q^{*} (σ_{Y}^{2})$ is

q^{*} (σ_{Y}^{2}) \propto exp {- (A_{Y} + (n J) / 2 + 1) log (σ_{Y}^{2}) - \frac{1}{σ_{Y}^{2}} (B_{Y} + \frac{1}{2} E_{- σ_{Y}^{2}} {| | Y - z β - Zb - CMg | |}^{2})} .

Next, we see that

\begin{array}{l} E_{- σ_{Y}^{2}} {{| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2}} \\ = {| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2} \\ + E_{- σ_{Y}^{2}} {{(z μ_{q (β)} - z β)}^{T} (z μ_{q (β)} - z β)} \\ + E_{- σ_{Y}^{2}} {{(μ_{q (C)} M μ_{q (g)} - CMg)}^{T} (μ_{q (C)} M μ_{q (g)} - CMg)} \\ + E_{- σ_{Y}^{2}} {{(Z μ_{q (b)} - Zb)}^{T} (Z μ_{q (b)} - Zb)} \\ = {| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2} + tr (z^{T} z \sum_{q (β)}) \\ - {(μ_{q (g)} M μ_{q (C)})}^{T} (μ_{q (g)} M μ_{q (C)}) \\ + μ_{q (g)}^{T} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M μ_{q (g)} \\ + tr {M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M \sum_{q (g)}} \\ + tr {Z^{T} Z \sum_{q (b)}} . \end{array}

(A.10)

Thus $q^{*} (σ_{Y}^{2})$ is $IG (A_{Y} + (n J) / 2, B_{q (σ_{Y}^{2})})$ where

\begin{array}{l} B_{q (σ_{Y}^{2})} = B_{Y} + \frac{1}{2} [{| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2} + tr (z^{T} z \sum_{q (β)}) \\ - {(μ_{q (g)} M μ_{q (C)})}^{T} (μ_{q (g)} M μ_{q (C)}) \\ + μ_{q (g)}^{T} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M μ_{q (g)} \\ + tr {M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M \sum_{q (g)}} + tr {Z^{T} Z \sum_{q (b)}}] \end{array}

(A.11)

Note that, when $q (σ_{Y}^{2}) = q_{σ_{Y}^{2}}^{*} (σ_{Y}^{2})$ , the term $μ_{q (1 / σ_{Y}^{2})}$ appearing regularly above is equal to $\frac{A_{Y} + (n J) / 2}{B_{q (σ_{Y}^{2})}}$ .

Appendix B: Expression for p(Y, W; q)

In this appendix we derive an expression for the lower bound of the log likelihood. This quantity is use to monitor convergence in Algorithm 1, and its derivation takes advantage of the order of updates in the algorithm to simplify the expression.

We have that $log \underline{p} (Y, W; q^{*}) = \int q (θ) log (\frac{p (Y, W, θ)}{q^{*} (θ)}) d θ = E_{q^{*}} [log p (Y, W, θ) - log q^{*} (θ)]$ . Now,

\begin{array}{l} E_{q^{*}} [log p (Y, W, θ) - log q^{*} (θ)] \\ = E_{q^{*}} [log p (Y ∣, β, b, C, g, σ_{Y}^{2})] + E_{q^{*}} [log p (W ∣ C, σ_{X}^{2})] \\ + E_{q^{*}} [log p (β) - log {q^{*} (β)}] + E_{q^{*}} [log p (g ∣ σ_{g}^{2}) - log {q^{*} (g)}] \\ + E_{q^{*}} [log p (b ∣ σ_{b}^{2}) - log {q^{*} (b)}] + E_{q^{*}} [log p (C ∣ λ) - log {q^{*} (C)}] \\ + E_{q^{*}} [log p (σ_{g}^{2}) - log {q^{*} (σ_{g}^{2})}] + E_{q^{*}} [log p (σ_{b}^{2}) - log {q^{*} (σ_{b}^{2})}] \\ + E_{q^{*}} [log p (σ_{Y}^{2}) - log {q^{*} (σ_{Y}^{2})}] + E_{q^{*}} [log p (σ_{X}^{2}) - log {q^{*} (σ_{X}^{2})}] \\ + \sum_{k = 1}^{K_{x}} E_{q^{*}} [log p (λ_{k}) - log {q^{*} (λ_{k})}] \end{array}

(B.1)

The first term appearing in (B.1) is

\begin{array}{l} E_{q^{*}} [log p (Y ∣ β, b, C, g, σ_{Y}^{2}] \\ = E_{q^{*}} [- \frac{n J}{2} log (2 π) - \frac{1}{2} log (∣ σ_{Y}^{2} I ∣) - \frac{1}{2} \frac{1}{σ_{Y}^{2}} {| | Y - z β - Zb - CMg | |}^{2}] \\ = - \frac{n J}{2} log (2 π) - \frac{n J}{2} E_{q^{*}} log (σ_{Y}^{2}) \\ - \frac{1}{2} μ_{q (1 / σ_{Y}^{2})} {{| | Y - z μ_{q (β)} - Zb - μ_{q (C)} M μ_{q (g)} | |}^{2} + tr (z^{T} z \sum_{q (β)}) \\ - {(μ_{q (g)} M μ_{q (C)})}^{T} (μ_{q (g)} M μ_{q (C)}) \\ + μ_{q (g)}^{T} M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M μ_{q (g)} \\ + tr {M^{T} (μ_{q (C)}^{T} μ_{q (C)} + (n J) \sum_{q (C)}) M \sum_{q (g)}} + tr {Z^{T} Z \sum_{q (b)}}} . \end{array}

The second term is

\begin{array}{l} E_{q^{*}} [log p (W ∣ C, σ_{X}^{2})] = E_{q^{*}} [\sum_{i = 1}^{n} \sum_{j = 1}^{J} (\frac{- N}{2} log (2 π) - \frac{N}{2} log (σ_{X}^{2}) - \frac{1}{2} \frac{1}{σ_{X}^{2}} {| | {(W_{i j} - C_{i j} ψ^{T})}^{T} | |}^{2})] \\ = - \frac{nJN}{2} log (2 π) - \frac{nJN}{2} E_{q^{*}} log (σ_{X}^{2}) - \frac{1}{2} μ_{q (1 / σ_{X}^{2})} [\sum_{i = 1}^{n} \sum_{j = 1}^{J} {| | {(W_{i j} - μ_{q (c), i j} ψ^{T})}^{T} | |}^{2} + (n J) tr (ψ^{T} ψ \sum_{q (C)})] . \end{array}

Next, we have

\begin{array}{l} E_{q^{*}} [log p (β) - log {q^{*} (β)}] = E_{q^{*}} [\frac{1}{2} log (\frac{∣ \sum_{q (β)} ∣}{σ_{β}^{2 p}}) - \frac{1}{2 σ_{β}^{2}} β^{T} β + \frac{1}{2} {(β - μ_{q (β)})}^{T} \sum_{q (β)}^{- 1} (β - μ_{q (β)})] \\ = \frac{1}{2} log (\frac{∣ \sum_{q (β)} ∣}{σ_{β}^{2 p}}) - \frac{1}{2 σ_{β}^{2}} {μ_{q (β)}^{T} μ_{q (β)} + tr (\sum_{q (β)})} + \frac{p}{2} . \end{array}

The fourth term is given by

\begin{array}{l} E_{q^{*}} [log p (g ∣ σ_{g}^{2}) - log {q^{*} (g)}] = E_{q^{*}} [\frac{1}{2} log (\frac{∣ \sum_{q (g)} ∣}{∣ σ_{g}^{2} D ∣}) - \frac{1}{2} \frac{1}{σ_{g}^{2}} g^{T} D^{- 1} g + \frac{1}{2} {(g - μ_{q (g)})}^{T} \sum_{q (g)}^{- 1} (g - μ_{q (g)})] \\ = \frac{1}{2} log (∣ \sum_{q (g)} ∣) - \frac{K_{g}}{2} E_{q^{*}} [log σ_{g}^{2}] - \frac{1}{2} log (∣ D ∣) - μ_{q (1 / σ_{g}^{2})} \frac{1}{2} {μ_{q (g)}^{T} D^{- 1} μ_{q (g)} + tr (D^{- 1} \sum_{q (g)})} \frac{K_{g}}{2} . \end{array}

Further, we have

\begin{array}{l} E_{q^{*}} [log p (C ∣ Λ) - log {q^{*} (C)}] = E_{q^{*}} [\sum_{k = 1}^{K_{x}} {\frac{n J}{2} log (\frac{{(\sum_{q (C)})}_{k k}}{λ_{k}}) - \frac{1}{2} \frac{1}{λ_{k}} C_{k}^{T} C_{k} + \frac{1}{2} {((C^{k}) - {(μ_{q (C)}^{k})}^{T})}^{T} {(\sum_{q (C)}^{- 1})}_{k k} ((C^{k}) - {(μ_{q (C)}^{k})}^{T})}] \\ = \sum_{k = 1}^{K_{x}} \frac{n J}{2} log {{(\sum_{q (C)})}_{k k}} - \sum_{k = 1}^{K_{x}} \frac{n J}{2} E_{q^{*}} log (λ_{k}) - \sum_{j = 1}^{K_{x}} \frac{1}{2} μ_{q (1 / λ_{k})} ({(μ_{q (C)}^{k})}^{T} (μ_{q (C)}^{k}) + (n J) {(\sum_{q (C)})}_{k k}) + \frac{(n J) K_{x}}{2} . \end{array}

The sixth term in (B.1) is

\begin{array}{l} E_{q^{*}} [log p (b ∣ σ_{b}^{2}) - q^{*} (b)] = E_{q^{*}} [\frac{1}{2} log (\frac{∣ \sum_{q (b)} ∣}{∣ σ_{b}^{2} I ∣}) - \frac{1}{2} \frac{1}{σ_{b}^{2}} b^{T} b + \frac{1}{2} {(b - μ_{q (b)})}^{T} \sum_{q (b)}^{- 1} (b - μ_{q (b)})] \\ = \frac{1}{2} E_{q^{*}} [log (∣ \sum_{q (b)} ∣) - \frac{n}{2} E_{q^{*}} [log σ_{b}^{2}] - μ_{q (1 / σ_{b}^{2})} \frac{1}{2} [μ_{q (b)}^{T} μ_{q (b)} + tr (\sum_{q (b)})] + \frac{n}{2} . \end{array}

Next,

\begin{array}{l} E_{q^{*}} [log p (σ_{g}^{2}) - log {q^{*} (σ_{g}^{2})}] \\ = \frac{K_{g}}{2} E_{q^{*}} log (σ_{g}^{2}) + μ_{q (1 / σ_{g}^{2})} (B_{q (σ_{g}^{2})} - B_{g}) + A_{g} log (B_{g}) - log {Γ (A_{g})} \\ - (A_{g} + \frac{K_{g}}{2}) log (B_{q (σ_{g}^{2})}) + log {Γ (A_{g} + \frac{K_{g}}{2})} . \end{array}

Additionally, the eighth term in (B.1) is

\begin{array}{l} E_{q^{*}} [log p (σ_{b}^{2}) - log {q^{*} (σ_{b}^{2})}] \\ = \frac{n}{2} E_{q^{*}} log (σ_{b}^{2}) + μ_{q (1 / σ_{b}^{2})} (B_{q (σ_{b}^{2})} - B_{b}) + A_{b} log (B_{b}) - log {Γ (A_{b})} \\ - (A_{b} + \frac{n}{2}) log (B_{q (σ_{b}^{2})}) + log {Γ (A_{b} + \frac{n}{2})} . \end{array}

Next, we have

\begin{array}{l} E_{q^{*}} [log p (σ_{Y}^{2}) - log {q^{*} (σ_{Y}^{2})}] \\ = \frac{n J}{2} E_{q^{*}} log (σ_{Y}^{2}) + μ_{q (1 / σ_{Y}^{2})} (B_{q (σ_{Y}^{2})} - B_{Y}) + A_{Y} log (B_{Y}) - log {Γ (A_{Y})} \\ - (A_{Y} + \frac{n}{2}) log (B_{q (σ_{Y}^{2})}) + log {Γ (A_{Y} + \frac{n}{2})} \end{array}

The tenth term is

\begin{array}{l} E_{q^{*}} [log p (σ_{X}^{2}) - log {q^{*} (σ_{X}^{2})}] \\ = \frac{nJN}{2} E_{q^{*}} log (σ_{X}^{2}) + μ_{q (1 / σ_{X}^{2})} (B_{q (σ_{X}^{2})} - B_{X}) + A_{X} log (B_{X}) \\ - log {Γ (A_{X})} - (A_{X} + \frac{nJN}{2}) log (B_{q (σ_{X}^{2})}) + log {Γ (A_{X} + \frac{nJN}{2})} \end{array}

Finally, for 1 ≤ k ≤ K_x

\begin{array}{l} E_{q^{*}} [log p (λ_{k}) - log {q^{*} (λ_{k})}] \\ = \frac{n J}{2} E_{q^{*}} log (λ_{k}) + μ_{q (1 / λ_{K})} (B_{q (λ_{k})} - B_{λ}) + A_{λ} log (B_{λ}) - log {Γ (A_{λ})} \\ - (A_{λ} + \frac{n J}{2}) log (B_{q (λ_{k})}) + log {Γ (A_{λ} + \frac{n J}{2})} \end{array}

We combine the above factors noting that many terms cancel. For example, the terms $- \frac{n J}{2} E_{q^{*}} log (σ_{Y}^{2})$ and $\frac{n J}{2} E_{q^{*}} log (σ_{Y}^{2})$ appear in $E_{q^{*}} [log p (Y ∣ β, b, C, g, σ_{Y}^{2})]$ and $E_{q^{*}} [log p (σ_{Y}^{2}) - log {q^{*} (σ_{Y}^{2})]$ respectively. Moreover, we can make substitutions for terms appearing in the updates given in Algorithm 1 and again simplify the expression. An example is to combine $- μ_{q (1 / σ_{g}^{2})} B_{g}$ and $- μ_{q (1 / σ_{g}^{2})} \frac{1}{2} [μ_{q (g)}^{T} D^{- 1} μ_{q (g)} + tr (D^{- 1} \sum_{q (b)})]$ and substitute for $- μ_{q (1 / σ_{g}^{2})} B_{q (σ_{g}^{2})}$ ; this term cancels with another appearing in $E_{q^{*}} [log p (σ_{g}^{2}) - log {q^{*} (σ_{g}^{2})}]$ . Thus we have

\begin{array}{l} log \underline{p} (Y, W; q) \\ = \frac{- n J}{2} log (2 π) + \frac{- nJN}{2} log (2 π) \frac{1}{2} log (\frac{∣ \sum_{q (β)} ∣}{σ_{β}^{2 p}}) \\ - \frac{1}{2 σ_{β}^{2}} {μ_{q (β)}^{T} μ_{q (β)} + tr (\sum_{q (β)})} + \frac{p}{2} \\ + \frac{1}{2} log (∣ \sum_{q (g)} ∣)] + \frac{K_{g}}{2} + \sum_{k = 1}^{K_{x}} \frac{n J}{2} log ({(\sum_{q (C)})}_{k k}) + \frac{{nJK}_{x}}{2} \\ + \frac{1}{2} log (∣ \sum_{q (b)} ∣) + \frac{n}{2} \\ + A_{g} log (B_{g}) - log (Γ (A_{g})) - (A_{g} + \frac{K_{g}}{2}) log (B_{q (σ_{g}^{2})}) \\ + log {Γ (A_{g} + \frac{K_{g}}{2})} \\ + A_{b} log (B_{b}) - log (Γ (A_{b})) - (A_{b} + \frac{n}{2}) log (B_{q (σ_{b}^{2})}) + log {Γ (A_{b} + \frac{n}{2})} \\ + A_{Y} log (B_{Y}) - log (Γ (A_{Y})) - (A_{Y} + \frac{n J}{2}) log (B_{q (σ_{Y}^{2})}) \\ + log {Γ (A_{Y} + \frac{n J}{2})} \\ + A_{X} log (B_{X}) - log (Γ (A_{X})) - (A_{X} + \frac{nJN}{2}) log (B_{q (σ_{X}^{2})}) \\ + log {Γ (A_{X} + \frac{nJN}{2})} \\ + \sum_{k = 1}^{K_{x}} [A_{λ} log (B_{λ}) - log (Γ (A_{λ})) - (A_{λ} + \frac{n J}{2}) log (B_{q (λ_{k})}) \\ + log {Γ (A_{λ} + \frac{n J}{2}) {] \end{array}

Using const. to represent an additive constant that is not affected by updates in Algorithm 1, we have

\begin{array}{l} log \underline{p} (Y, W; q) \\ = \frac{1}{2} log (\frac{∣ \sum_{q (β)} ∣}{σ_{β}^{2 p}}) - \frac{1}{2 σ_{β}^{2}} {μ_{q (β)}^{T} μ_{q (β)} + tr (\sum_{q (β)})} \\ + \frac{1}{2} E_{q^{*}} [log (∣ \sum_{q (g)} ∣)] + \sum_{k = 1}^{K_{x}} \frac{n J}{2} log ({(\sum_{q (C)})}_{k k}) + \frac{1}{2} E_{q^{*}} [log (∣ \sum_{q (b)} ∣)] \\ - (A_{g} + \frac{K_{g}}{2}) log (B_{q (σ_{g}^{2})}) - (A_{b} + \frac{n}{2}) log (B_{q (σ_{b}^{2})}) \\ - (A_{Y} + \frac{n J}{2}) log (B_{q (σ_{Y}^{2})}) \\ - (A_{X} + \frac{nJN}{2}) log (B_{q (σ_{X}^{2})}) - \sum_{k = 1}^{K_{x}} {(A_{λ} + \frac{n J}{2}) log (B_{q (λ_{k})})} + const . \end{array}

Contributor Information

Jeff Goldsmith, Email: jgoldsmi@jhsph.edu, Johns Hopkins Bloomberg School of Public Health Department of Biostatistics 615 North Wolfe Street Baltimore, Maryland 21205, USA.

Matt P. Wand, Email: Matt.Wand@uts.edu.au, School of Mathematical Sciences University of Technology, Sydney P.O. Box 123 Broadway, 2007, Australia.

Ciprian Crainiceanu, Email: ccrainic@jhsph.edu, Johns Hopkins Bloomberg School of Public Health Department of Biostatistics 615 North Wolfe Street Baltimore, Maryland 21205, USA.

References

1.Basser P, Mattiello J, LeBihan D. MR Diffusion Tensor Spectroscopy and Imaging. Biophysical Journal. 1994;66:259–267. doi: 10.1016/S0006-3495(94)80775-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Basser P, Pajevic S, Pierpaoli C, Duda J. In vivo fiber tractography using DT-MRI data. Magnetic Resonance in Medicine. 2000;44:625–632. doi: 10.1002/1522-2594(200010)44:4<625::aid-mrm17>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
3.Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]
4.Cardot H, Ferraty F, Sarda P. Functional Linear Model. Statistics and Probability Letters. 1999;45:11–22. [Google Scholar]
5.Crainiceanu CM, Goldsmith J. Bayesian Functional Data Analysis using WinBUGS. Journal of Statistical Software. 2010;32:1–33. [PMC free article] [PubMed] [Google Scholar]
6.Goldsmith J, Bobb J, Crainiceanu CM, Caffo B, Reich D. Penalized Functional Regression. Journal of Computational and Graphical Statistics. doi: 10.1198/jcgs.2010.10007. To Appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Goldsmith J, Crainiceanu CM, Caffo B, Reich D. A Case Study of Longitudinal Association Between Disability and Neuronal Tract Measurements. Under Review 2011 [Google Scholar]
8.Gronwall DMA. Paced auditory serial-addition task: A measure of recovery from concussion. Perceptual and Motor Skills. 1977;44:367–373. doi: 10.2466/pms.1977.44.2.367. [DOI] [PubMed] [Google Scholar]
9.James GM, JW, JZ Functional Linear Regression That’s Interpretable. Annals of Statistics. 2009;37:2083–2108. [Google Scholar]
10.Jordan MI. Graphical models. Statistical Science. 2004;19:140–155. [Google Scholar]
11.Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An Introduction to Variational Methods for Graphical Models. Machine Learning. 1999;37:183–233. [Google Scholar]
12.Kullback S, Leibler D. On Information and Sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86. [Google Scholar]
13.Lang S, Brezger A. Bayesian P-splines. Journal of Computational and Graphical Statistics. 2004;13:183–212. [Google Scholar]
14.LeBihan D, Mangin J, Poupon C, Clark C. Diffusion Tensor Imaging: Concepts and Applications. Journal of Magnetic Resonance Imaging. 2001;13:534–546. doi: 10.1002/jmri.1076. [DOI] [PubMed] [Google Scholar]
15.Lin X, Tench CR, Morgan PS, Constantinescu CS. Use of combined conventional and quantitative MRI to quantify pathology related to cognitive impairment in multiple sclerosis. Journal of Neurology, Neurosurgery, and Psychiatry. 2008;237:437–441. doi: 10.1136/jnnp.2006.112177. [DOI] [PubMed] [Google Scholar]
16.McGrory CA, Titterington DM, Reeves R, Pettitt AN. Variational Bayes for Estimating the Parameters of a Hidden Potts Model. Statistics and Computing. 2009;19:329–340. [Google Scholar]
17.Mori S, Barker P. Diffusion magnetic resonance imaging: its principle and applications. The Anatomical Record. 1999;257:102–109. doi: 10.1002/(SICI)1097-0185(19990615)257:3<102::AID-AR7>3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
18.Müller H-G, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]
19.Ormerod J, Wand MP. Explaining Variational Approximations. The American Statistician. 2010;64:140–153. [Google Scholar]
20.Ozturk A, Smith S, Gordon-Lipkin E, Harrison D, Shiee N, Pham D, Caffo B, Calabresi P, Reich D. MRI of the corpus callosum in multiple sclerosis: association with disability. Multiple Sclerosis. 2010;16:166–177. doi: 10.1177/1352458509353649. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ramsay JO, Silverman BW. Functional Data Analysis. New York: Springer; 2005. [Google Scholar]
22.Reiss P, Ogden R. Functional Principal Component Regression and Functional Partial Least Squares. Journal of the American Statistical Association. 2007;102:984–996. [Google Scholar]
23.Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge: Cambridge University Press; 2003. [Google Scholar]
24.Teschendorff AE, Wang Y, Barbosa-Morais NL, Brenton JD, Caldas C. A Variational Bayesian Mixture Modeling Framework for Cluster Analysis of Gene-Expression Data. Bioinformatics. 2005;21:3025–3033. doi: 10.1093/bioinformatics/bti466. [DOI] [PubMed] [Google Scholar]
25.Titterington DM. Bayesian Methods for Neural Networks and Related Models. Statistical Science. 2004;19:128–139. [Google Scholar]

[R1] 1.Basser P, Mattiello J, LeBihan D. MR Diffusion Tensor Spectroscopy and Imaging. Biophysical Journal. 1994;66:259–267. doi: 10.1016/S0006-3495(94)80775-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Basser P, Pajevic S, Pierpaoli C, Duda J. In vivo fiber tractography using DT-MRI data. Magnetic Resonance in Medicine. 2000;44:625–632. doi: 10.1002/1522-2594(200010)44:4<625::aid-mrm17>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]

[R3] 3.Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]

[R4] 4.Cardot H, Ferraty F, Sarda P. Functional Linear Model. Statistics and Probability Letters. 1999;45:11–22. [Google Scholar]

[R5] 5.Crainiceanu CM, Goldsmith J. Bayesian Functional Data Analysis using WinBUGS. Journal of Statistical Software. 2010;32:1–33. [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Goldsmith J, Bobb J, Crainiceanu CM, Caffo B, Reich D. Penalized Functional Regression. Journal of Computational and Graphical Statistics. doi: 10.1198/jcgs.2010.10007. To Appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Goldsmith J, Crainiceanu CM, Caffo B, Reich D. A Case Study of Longitudinal Association Between Disability and Neuronal Tract Measurements. Under Review 2011 [Google Scholar]

[R8] 8.Gronwall DMA. Paced auditory serial-addition task: A measure of recovery from concussion. Perceptual and Motor Skills. 1977;44:367–373. doi: 10.2466/pms.1977.44.2.367. [DOI] [PubMed] [Google Scholar]

[R9] 9.James GM, JW, JZ Functional Linear Regression That’s Interpretable. Annals of Statistics. 2009;37:2083–2108. [Google Scholar]

[R10] 10.Jordan MI. Graphical models. Statistical Science. 2004;19:140–155. [Google Scholar]

[R11] 11.Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An Introduction to Variational Methods for Graphical Models. Machine Learning. 1999;37:183–233. [Google Scholar]

[R12] 12.Kullback S, Leibler D. On Information and Sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86. [Google Scholar]

[R13] 13.Lang S, Brezger A. Bayesian P-splines. Journal of Computational and Graphical Statistics. 2004;13:183–212. [Google Scholar]

[R14] 14.LeBihan D, Mangin J, Poupon C, Clark C. Diffusion Tensor Imaging: Concepts and Applications. Journal of Magnetic Resonance Imaging. 2001;13:534–546. doi: 10.1002/jmri.1076. [DOI] [PubMed] [Google Scholar]

[R15] 15.Lin X, Tench CR, Morgan PS, Constantinescu CS. Use of combined conventional and quantitative MRI to quantify pathology related to cognitive impairment in multiple sclerosis. Journal of Neurology, Neurosurgery, and Psychiatry. 2008;237:437–441. doi: 10.1136/jnnp.2006.112177. [DOI] [PubMed] [Google Scholar]

[R16] 16.McGrory CA, Titterington DM, Reeves R, Pettitt AN. Variational Bayes for Estimating the Parameters of a Hidden Potts Model. Statistics and Computing. 2009;19:329–340. [Google Scholar]

[R17] 17.Mori S, Barker P. Diffusion magnetic resonance imaging: its principle and applications. The Anatomical Record. 1999;257:102–109. doi: 10.1002/(SICI)1097-0185(19990615)257:3<102::AID-AR7>3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]

[R18] 18.Müller H-G, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]

[R19] 19.Ormerod J, Wand MP. Explaining Variational Approximations. The American Statistician. 2010;64:140–153. [Google Scholar]

[R20] 20.Ozturk A, Smith S, Gordon-Lipkin E, Harrison D, Shiee N, Pham D, Caffo B, Calabresi P, Reich D. MRI of the corpus callosum in multiple sclerosis: association with disability. Multiple Sclerosis. 2010;16:166–177. doi: 10.1177/1352458509353649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Ramsay JO, Silverman BW. Functional Data Analysis. New York: Springer; 2005. [Google Scholar]

[R22] 22.Reiss P, Ogden R. Functional Principal Component Regression and Functional Partial Least Squares. Journal of the American Statistical Association. 2007;102:984–996. [Google Scholar]

[R23] 23.Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge: Cambridge University Press; 2003. [Google Scholar]

[R24] 24.Teschendorff AE, Wang Y, Barbosa-Morais NL, Brenton JD, Caldas C. A Variational Bayesian Mixture Modeling Framework for Cluster Analysis of Gene-Expression Data. Bioinformatics. 2005;21:3025–3033. doi: 10.1093/bioinformatics/bti466. [DOI] [PubMed] [Google Scholar]

[R25] 25.Titterington DM. Bayesian Methods for Neural Networks and Related Models. Statistical Science. 2004;19:128–139. [Google Scholar]

PERMALINK

Functional regression via variational Bayes

Jeff Goldsmith

Matt P Wand

Ciprian Crainiceanu

Abstract

1. Introduction

Fig 1.

2. Background

2.1. Variational Bayes

Result 2.1

2.2. Penalized functional regression

Fig 2.

3. Variational approximations for penalized functional regression

Algorithm 1.

4. Simulations

4.1. Cross-sectional functional regression

Table 1.

Fig 3.

Table 2.

4.2. Longitudinal functional regression

Table 3.

5. Application

Fig 4.

6. Discussion

Appendix A: Derivations

A.1. Optimal densities for g and $σ_{g}^{2}$

A.2. Optimal densities for b and $σ_{b}^{2}$

A.3. Optimal densities for C and λ_j

A.4. Optimal density for β

A.5. Optimal density for $σ_{X}^{2}$

A.6. Optimal density for $σ_{Y}^{2}$

Appendix B: Expression for p(Y, W; q)

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Functional regression via variational Bayes

Jeff Goldsmith

Matt P Wand

Ciprian Crainiceanu

Abstract

1. Introduction

Fig 1.

2. Background

2.1. Variational Bayes

Result 2.1

2.2. Penalized functional regression

Fig 2.

3. Variational approximations for penalized functional regression

Algorithm 1.

4. Simulations

4.1. Cross-sectional functional regression

Table 1.

Fig 3.

Table 2.

4.2. Longitudinal functional regression

Table 3.

5. Application

Fig 4.

6. Discussion

Appendix A: Derivations

A.1. Optimal densities for g and σg2

A.2. Optimal densities for b and σb2

A.3. Optimal densities for C and λj

A.4. Optimal density for β

A.5. Optimal density for σX2

A.6. Optimal density for σY2

Appendix B: Expression for p(Y, W; q)

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.1. Optimal densities for g and $σ_{g}^{2}$

A.2. Optimal densities for b and $σ_{b}^{2}$

A.3. Optimal densities for C and λ_j

A.5. Optimal density for $σ_{X}^{2}$

A.6. Optimal density for $σ_{Y}^{2}$