Bayesian tensor-on-tensor regression with efficient computation

Kunbo Wang; Yanxun Xu

doi:10.4310/23-sii786

. Author manuscript; available in PMC: 2024 Mar 11.

Published in final edited form as: Stat Interface. 2024 Feb 1;17(2):199–217. doi: 10.4310/23-sii786

Bayesian tensor-on-tensor regression with efficient computation

Kunbo Wang ¹, Yanxun Xu ^2,^*

PMCID: PMC10927259 NIHMSID: NIHMS1968746 PMID: 38469276

Abstract

We propose a Bayesian tensor-on-tensor regression approach to predict a multidimensional array (tensor) of arbitrary dimensions from another tensor of arbitrary dimensions, building upon the Tucker decomposition of the regression coefficient tensor. Traditional tensor regression methods making use of the Tucker decomposition either assume the dimension of the core tensor to be known or estimate it via cross-validation or some model selection criteria. However, no existing method can simultaneously estimate the model dimension (the dimension of the core tensor) and other model parameters. To fill this gap, we develop an efficient Markov Chain Monte Carlo (MCMC) algorithm to estimate both the model dimension and parameters for posterior inference. Besides the MCMC sampler, we also develop an ultra-fast optimization-based computing algorithm wherein the maximum a posteriori estimators for parameters are computed, and the model dimension is optimized via a simulated annealing algorithm. The proposed Bayesian framework provides a natural way for uncertainty quantification. Through extensive simulation studies, we evaluate the proposed Bayesian tensor-on-tensor regression model and show its superior performance compared to alternative methods. We also demonstrate its practical effectiveness by applying it to two real-world datasets, including facial imaging data and 3D motion data.

Keywords: Fractional Bayes Factor, Markov chain Monte Carlo, Tensor-on-tensor regression, Tucker decomposition

1. INTRODUCTION

Multi-dimensional arrays, also called tensors, are widely used to represent data with complex structures in different fields such as genomics, neuroscience, computer vision, and graph analysis. For example, a multi-tissue experiment (Wang et al., 2019) collects gene expression data in different tissues from different individuals, leading to three-dimensional arrays (Genes × Tissues × Individuals). Other notable examples include magnetic resonance imaging data (MRI, three-dimensional arrays), functional MRI (fMRI) data (four-dimensional arrays), and facial images (four-dimensional arrays) (Vasilescu and Terzopoulos, 2002; Hasan et al., 2011; Guhaniyogi and Spencer, 2021). In this paper, we focus on the task of tensor-on-tensor regression that predicts one multi-dimensional tensor from another multi-dimensional tensor, e.g., predicting gene expression across multiple tissues for multiple individuals from their clinical/omics data with tensor structures.

One simple approach dealing with tensor-on-tensor regression is to turn tensors into vectors, and then apply classic regression methods. However, such a treatment introduces high-dimensional unstructured vectors and destroys the correlation structure of data, resulting in a huge number of parameters to be estimated and potentially significant loss of information. For example, to predict a response tensor of dimensions $N \times Q_{1} \times Q_{2}$ from a predictor tensor of dimensions $N \times P_{1} \times P_{2}$ , the classic linear regression method requires estimating $P_{1} \times P_{2} \times Q_{1} \times Q_{2}$ parameters, which may cause overfitting or computational issues, especially when the number of parameters is larger than the sample size $N$ .

To reduce the number of free parameters while preserving the correlation structure in modeling tensor data, tensor decomposition techniques have been widely applied (Kolda and Bader, 2009). The two most commonly-used tensor decomposition methods are the PARAFAC/CANDECOMP (CP) decomposition (Harshman, 1970) and Tucker decomposition (Tucker, 1966). The CP decomposition reconstructs a tensor as a linear combination of rank-1 tensors, each one of which is represented as the outer product of a number of vectors. On the other hand, the Tucker decomposition factorizes a tensor into a small core tensor and a set of matrices along each dimension. Both decomposition methods are able to reduce model dimensionality to a manageable size and make parameter estimation more efficient. Compared to CP decomposition, Tucker decomposition allows a more flexible correlation structure processed by the core tensor and the freedom in choosing different orders, making it useful in estimating data with skewed dimensions (Li et al., 2013). In fact, CP decomposition is a special case of Tucker decomposition with the core tensor being superdiagonal.

There is a rich literature on regression methods treating tensors as either predictors or responses in both frequentist and Bayesian statistics. Guo et al. (2012) and Zhou et al. (2013) proposed tensor regression models to predict scalar outcomes from tensor predictors by assuming that the coefficient tensor has a low rank CP decomposition. Li et al. (2013) later extended the framework by employing Tucker decomposition for the coefficient tensor, and demonstrated that Tucker decomposition is more suitable to deal with tensor predictors of skewed dimensions and gains better accuracy in neuroimaging data analysis. Guhaniyogi et al. (2017) proposed a Bayesian approach to regression with a scalar response on tensor predictors by developing a multiway Dirichlet generalized double Pareto prior on tensor margins after applying CP decomposition to the coefficient tensor. Miranda et al. (2018) developed a Bayesian tensor partition regression model using a generalized linear model with a sparse inducing normal mixture prior to learn the relationship between a matrix response (clinical outcomes) and a tensor predictor (imaging data). Li and Zhang (2017) proposed a parsimonious regression model with tensor response and vector predictors adopting a generalized sparsity principle based on Tucker decomposition. To detect neuronal activation in fMRI experiments, Guhaniyogi and Spencer (2021) developed a Bayesian regression approach with a tensor response on scalar predictors by introducing a novel multiway stick breaking shrinkage prior distribution on tensor structured regression coefficients.

There exist many scientific applications that require methods for predicting a tensor response from another tensor predictor. One typical example in fMRI studies is to detect brain regions activated by an external stimulus or condition (Zhang et al., 2015). Hoff (2015) proposed a tensor-on-tensor bilinear regression framework to handle a special case where the tensor predictor has the same dimension as the tensor response making use of Tucker decomposition. Billio et al. (2018) introduced a Bayesian tensor autoregressive model to tackle tensor-on-tensor regression, and used CP decomposition to provide parsimonious parametrization. Lock (2018) proposed to predict a tensor response from another tensor predictor by assuming that the coefficient tensor has a low-rank CP factorization. Gahrooei et al. (2021) extended the work of Lock (2018) to allow multiple tensor inputs under the Tucker decomposition framework.

Despite advances in methods development for dealing with tensor data, there are some limitations in the aforementioned methods. First, tensor-on-tensor regression methods based on CP decomposition, e.g., Lock (2018), require both the response tensor and the predictor tensor to have the same rank in CP decomposition, making them restrictive when the response and predictor tensors have different ranks. Second, the rank in CP decomposition and the dimension of the core tensor in Tucker decomposition (i.e., model dimension) are essential for statistical inference in tensor-on-tensor regression models. However, they are either assumed known or estimated via cross-validation (Gahrooei et al., 2021) or some model selection criteria, such as Bayesian information criterion (Guhaniyogi and Spencer, 2021). To our best knowledge, there is no existing method that can simultaneously estimate the model dimension and parameters.

In this paper, we develop a novel Bayesian approach for tensor-on-tensor regression based on Tucker decomposition of the coefficient tensor. The main contributions of this work are threefold. First, our Bayesian framework is built upon the flexible Tucker decomposition so that the response tensor and the predictor tensor can have different dimensions in the core tensor. Second, we propose an efficient Markov chain Monte Carlo (MCMC) algorithm to simultaneously estimate the model dimension (the dimension of the core tensor) and parameters. The resulting posterior inference naturally offers us characterization of uncertainty in parameter estimation and prediction. Third, as an alternative to MCMC, we develop an ultra-fast computing algorithm, in which the maximum a posteriori (MAP) estimators for parameters are computed and meanwhile the dimension of the core tensor is optimized via a simulated annealing (SA) algorithm (Kirkpatrick et al., 1983).

The rest of the article is organized as follows. We start with introducing some preliminaries in Section 2. Section 3 describes the proposed Bayesian tensor-on-tensor regression model. We develop an efficient MCMC algorithm to simultaneously estimate the model dimension and parameters in Section 4. An optimization-based ultra-fast computational algorithm for inference is described in Section 5. Section 6 evaluates the proposed approach via simulation studies and comparisons to alternative methods. Section 7 provides real data analyses on facial imaging data and 3D motion data. Section 8 concludes with a discussion.

2. PRELIMINARIES

2.1. Notations

We begin with introducing notations and operations that will be used throughout the paper. We use uppercase blackboard bold characters $(X)$ to denote tensors, bold uppercase characters $(X)$ to denote matrices, and bold lowercase characters (a) to denote vectors. The order of a tensor is the number of dimensions. For example, $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ denotes an $N$ th order tensor, where $I_{n}$ denotes the dimension of the $n$ th mode, $n = 1, \dots, N$ . The $i$ th entry of a vector $a$ is denoted as $a_{i}$ ; the element $(i, j)$ of a matrix $X$ is denoted as $X_{i j}$ ; and the entries of a tensor are defined by indices enclosed in square brackets: $X_{[i_{1}, \dots, i_{N}]}$ , where $i_{n} \in \{1, \dots, I_{n}\}$ for $n \in {1, \dots N}$ . The $n$ th element in a sequence of matrices or vectors is denoted by a subscript in parenthesis. For example, $X_{(n)}$ denotes the $n$ th matrix in a sequence of matrices, and $x_{(n)}$ denotes the $n$ th vector in a sequence of vectors.

The vectorization of a tensor $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ transforms an $N$ th order tensor into a column vector $v e c X$ such that the entry $X_{[i_{1}, \dots, i_{N}]}$ maps to the $j$ th entry of $v e c X$ , that is

X_{[i_{1}, \dots, i_{N}]} = vec X_{j},

(1)

where $j = 1 + \sum_{k = 1}^{N} (i_{k} - 1) \prod_{l = 1}^{k - 1} I_{l}$ . Similarly, $v e c X$ is used to denote the vectorization of a matrix $X \in R^{I_{1} \times I_{2}}$ when $N = 2$ in (1). Matricization, also known as unfolding, is the process of transforming a tensor into a matrix. The mode- $n$ matricization of a tensor $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ is denoted by $X_{(n)} \in R^{I_{n} \times J}$ where $J = \prod_{k \neq n} I_{k}$ . The entry $X_{[i_{1}, \dots, i_{N}]}$ of $X$ maps to the $(i_{n}, j)$ element of the resulting matrix $X_{(n)}$ , where

j = 1 + \sum_{\begin{matrix} k = 1 \\ k \neq n \end{matrix}}^{N} (i_{k} - 1) J_{k} with J_{k} = \prod_{\begin{matrix} l = 1 \\ l \neq n \end{matrix}}^{k - 1} I_{l} .

A more general treatment of the tensor matricization is defined as follows. Let $ℛ = \{r_{1}, \dots, r_{L}\}$ and $𝒞 = \{c_{1}, \dots, c_{M}\}$ be two sets of indices such that $ℛ \cup 𝒞 = {1, \dots, N}$ and $ℛ \cap 𝒞 = \emptyset$ . Then the matricized tensor can be specified by $X_{(ℛ \times 𝒞)} \in R^{J \times K}$ , where $J = \prod_{n \in ℛ} I_{n}$ and $K = \prod_{n \in 𝒞} I_{n}$ . And the entry $X_{[i_{1}, \dots, i_{N}]}$ maps to the $(j, k)$ element of the matrix $X_{(ℛ \times 𝒞)}$ , that is

X_{[i_{1}, \dots, i_{N}]} = {(X_{(ℛ \times 𝒞)})}_{j k},

(2)

where

j = 1 + \sum_{l = 1}^{L} [(i_{r_{l}} - 1) \prod_{l^{'} = 1}^{l - 1} I_{r_{l^{'}}}],

and

k = 1 + \sum_{m = 1}^{M} [(i_{c_{m}} - 1) \prod_{m^{'} = 1}^{m - 1} I_{c_{m^{'}}}] .

The Kronecker product of matrices $U \in R^{I \times J}$ , and $V \in R^{K \times L}$ is denoted by $U \otimes V$ with the detailed definition and properties shown in A.1. The product of a tensor and a matrix in mode $n$ is defined as the $n$ -mode product. The $n$ -mode product of $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ with a matrix $U \in R^{J \times I_{n}}$ is denoted by $X \times_{n} U$ , resulting in a new tensor $Y \in R^{I_{1} \times \dots \times I_{n - 1} \times J \times I_{n + 1} \times \dots \times I_{N}}$ where the $[i_{1}, \dots i_{n - 1}, j, i_{n + 1}, \dots i_{N}]$ entry is defined by

Y_{[i_{1}, \dots, i_{n - 1}, j, i_{n + 1}, \dots, i_{N}]} = \sum_{i_{n} = 1}^{I_{n}} X_{[i_{1}, \dots, i_{N}]} U_{j i_{n}} .

An important fact regarding the $n$ -mode product is that given matrices $U \in R^{J_{1} \times I_{n}}, V \in R^{J_{2} \times I_{m}}$ with $m \neq n$ , and tensor $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ , then

X \times_{n} U \times_{m} V = (X \times_{n} U) \times_{m} V = (X \times_{m} V) \times_{n} U .

For two tensors $X \in R^{I_{1} \times \dots \times I_{N} \times P_{1} \times \dots \times P_{L}}$ , and $Y \in R^{P_{1} \times \dots \times P_{L} \times J_{1} \times \dots \times J_{M}}$ , the contracted tensor product $⟨ X, Y ⟩_{L}$ is defined as

Z = ⟨ X, Y ⟩_{L} \in R^{I_{1} \times \dots \times I_{N} \times J_{1} \times \dots \times J_{M}}

with

Z_{[i_{1}, \dots, i_{N}, j_{1}, \dots, j_{M}]} = \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} X_{[i_{1}, \dots, i_{N}, p_{1}, \dots, p_{L}]} Y_{[p_{1}, \dots, p_{L}, j_{1}, \dots, j_{M}]} .

It can be shown that for two matrices $U \in R^{I \times P}$ and $V \in R^{P \times J}$ , the contracted product $⟨ U, V ⟩_{1}$ is equivalent to the standard matrix product $U V$ . Therefore, the contracted product of two tensors can be regarded as an extension of the usual matrix product to higher-order operands.

2.2. Tucker decomposition

The proposed Bayesian tensor-on-tensor model is built upon Tucker decomposition (Tucker, 1966), which decomposes a tensor $B \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ into a core tensor $G$ and a set of factor matrices $A_{(n)}, n = 1, \dots, N$ , denoted by

B = ⟦ G; A_{(1)}, A_{(2)}, \dots, A_{(N)} ⟧ .

Or equivalently,

B = G \times_{1} A_{(1)} \times_{2} A_{(2)} \dots \times_{N} A_{(N)},

with $G \in R^{J_{1} \times \dots \times J_{N}}$ being the core tensor and $A_{(n)} \in R^{I_{n} \times J_{n}}$ being the factor matrix in mode $n$ , for $n = 1, \dots, N$ , forming a sequence of matrices. The order of $G$ can be the same as the order of $B$ , but more often, we are interested in compressing the information of $B$ to a smaller size of $G$ than $B$ . CP decomposition (Harshman, 1970) is a special case of Tucker decomposition wherein the core tensor is superdiagonal.

3. A BAYESIAN TENSOR-ON-TENSOR REGRESSION MODEL

Our task is to predict a tensor response $Y \in R^{N \times Q_{1} \times \dots \times Q_{M}}$ from a tensor predictor $X \in R^{N \times P_{1} \times \dots \times P_{L}}$ . We propose a Bayesian tensor-on-tensor regression framework by extending the standard multivariate linear regression model from matrices to tensors:

Y = ⟨ X, B ⟩_{L} + E,

(3)

where $B \in R^{P_{1} \times \dots \times P_{L} \times Q_{1} \times \dots \times Q_{M}}$ denotes the coefficient tensor. In this paper, we assume that each element of $E \in R^{N \times Q_{1} \times \dots \times Q_{M}}$ follows $N (0, σ^{2})$ independently for illustration simplicity. More flexible covariance structures are discussed in Appendix A.6. The first $L$ modes of $B$ contract the dimensions of $X$ and the last $M$ modes of $B$ match the modes of $Y$ . For each of the $N$ observations, there are a total of $(\prod_{m = 1}^{M} Q_{m})$ responses and $(\prod_{l = 1}^{L} P_{l})$ predictors. The model can be reformulated into a matrix form as follows,

Y_{(1)} = X_{(1)} B_{(𝒫 \times 𝒬)} + E_{(1)},

(4)

where $𝒫 = {1, \dots, L}$ , $𝒬 = {L + 1, \dots, L + M}$ , and each row of $E_{(1)}$ independently follows $N (0, σ^{2} I_{d})$ with $I_{d}$ being an identity matrix of dimension $(\prod_{m = 1}^{M} Q_{m})$ . From the equivalence of (3) and (4), it is clear that our model (3) supports linear relations between responses and predictors for each observation.

If $B$ is unconstrained, the estimation of $B$ can be obtained by conducting separate ordinary least squares (OLS) regressions for each of the $\prod_{m = 1}^{M} Q_{m}$ responses in $Y_{(1)}$ over $X_{(1)}$ by equation (4) in the frequentist framework. However, the solution is not well-defined if the number of observations $N$ is less than the number of responses $\prod_{m = 1}^{M} Q_{m}$ . Even if the solution is well-defined, separate OLS of equation (4) does not consider the correlation structure within the response $Y$ and predictor $X$ and between them. Moreover, the total number of parameters in this case is $\prod_{l = 1}^{L} P_{l} \prod_{m = 1}^{M} Q_{m}$ , which can be computationally challenging due to its gigantic size.

In our model, we assume that $B$ follows the Tucker decomposition defined by

B = ⟦ G; U_{(1)}, \dots, U_{(L)}, V_{(1)}, \dots, V_{(M)} ⟧,

(5)

where $G$ denotes the core tensor of dimensions $R_{1} \times \dots \times R_{L} \times S_{1} \times \dots \times S_{M}, U_{(l)} \in R^{P_{l} \times R_{l}}$ denotes the factor matrix corresponding to the mode $l$ in $B$ for $l = 1, \dots, L$ , and $V_{(m)} \in R^{Q_{m} \times S_{m}}$ denotes the factor matrix corresponding to the mode $L + m$ in $B$ for $m = 1, \dots, M$ .

We complete the proposed Bayesian tensor-on-tensor regression model by assigning priors to the core tensor $G, {\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}$ , and $σ^{2}$ . For the core tensor $G$ , we consider a normal prior, that is $v e c G ~ N (μ_{G}, Σ_{G})$ where $Σ_{G}$ is diagonal. For factor matrices ${\{U_{(l)}\}}_{l = 1}^{L}$ and ${\{V_{(m)}\}}_{m = 1}^{M}$ , taking $U_{(l)}$ and $V_{(m)}$ as examples, we assign normal priors for $v e c U^{(l)}$ and $v e c V^{(m)}$ . That is, $v e c U_{(l)} ~ N (μ_{U_{l}}, Σ_{U_{l}})$ , and $v e c V_{(m)} ~ N (μ_{V_{m}}, Σ_{V_{m}})$ , where $Σ_{U_{l}}$ and $Σ_{V_{m}}$ are diagonal matrices. Usually we choose $μ_{U_{l}} = μ_{U}, μ_{V_{m}} = μ_{V}, Σ_{U_{l}} = Σ_{U}$ , and $Σ_{V_{m}} = Σ_{V}$ for all $l = 1, \dots, L$ and $m = 1, \dots, M$ . Lastly we assign an inverse gamma prior distribution for $σ^{2} : σ^{2} ~ I G (α, β)$ . A discussion on other choices of prior distributions and covariance structures is provided in Appendix A.6.

4. POSTERIOR INFERENCE

We conduct posterior inference using Markov chain Monte Carlo (MCMC) simulations. Given the dimension $θ = (R_{1}, \dots, R_{L}, S_{1}, \dots, S_{M})$ of the core tensor, we update $U_{(l)}, V_{(m)}, G$ , and $σ^{2}$ using Gibbs sampling transition probabilities for posterior updates, the details of which will be given in Section 4.1. The posterior update of the core tensor dimension $θ$ is challenging since the dimensions of $U_{(l)}, V_{(m)}$ , and $G$ change when $θ$ varies. A reversible jump (RJ) MCMC (Green, 1995) algorithm is a natural choice for such a trans-dimensional update, however, it is difficult to construct a practicable RJ scheme due to the high-dimensionality of the problem. To address this challenge, we will develop an efficient Metropolis-Hastings (MH) algorithm to update $θ$ building upon the idea of fractional Bayes factor (Lee et al., 2016; O’Hagan, 1995) in Section 4.2. The R code for the proposed algorithms with demonstrating examples can be found at GitHub.

4.1. Posterior inference given the dimension of the core tensor

Given the dimension $θ = (R_{1}, \dots, R_{L}, S_{1}, \dots, S_{M})$ of the core tensor $G$ , we derive the full conditional posterior distributions of ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G$ , and $σ^{2}$ in closed forms. Without loss of generality, we first derive the full conditional posterior distribution of $U_{(1)}$ . The full conditional posterior distributions of $\{U_{(2)}, \dots, U_{(L)}\}$ can be derived in the same manner.

By properties of $n$ -mode product of tensor and Tucker decomposition, we have

B = G \times_{2} U_{(2)} \dots \times_{L} U_{(L)} \times_{L + 1} V_{(1)} \dots \times_{L + M} V_{(M)} \times_{1} U_{(1)} .

Let $B_{(-)}$ denote $G \times_{2} U_{(2)} \dots \times_{L} U_{(L)} \times_{L + 1} V_{(1)} \dots \times_{L + M} V_{(M)}$ , then $B_{(-)} \in R^{R_{1} \times P_{2} \times \dots \times P_{L} \times Q_{1} \times \dots \times Q_{M}}$ , and

B = B_{(-)} \times_{1} U_{(1)} .

We denote the contracted product of ${⟨B_{(-)}, X⟩}_{P_{2}, \dots, P_{N}}$ by a new tensor called $C$ , then tensor $C \in R^{R_{1} \times N \times P_{1} \times Q_{1} \times \dots Q_{M}}$ . By tensor matricization of $C$ into $C_{(ℛ \times 𝒞)} \in R^{N} \prod_{m = 1}^{M} Q_{m} \times R_{1} P_{1}$ , where $ℛ = \{N, Q_{1}, \dots, Q_{M}\}$ , and $𝒞 = \{R_{1}, P_{1}\}$ , we can rewrite our model (3) as follows:

v e c Y = C_{(ℛ \times 𝒞)} \times v e c U_{(1)} + v e c E .

(6)

The proof for equation (6) is given in Appendix A.2.

Following (6), we can easily derive that the full conditional posterior distribution of $v e c U_{(1)}$ is normally distributed:

p (v e c U_{(1)} ∣ v e c Y, X, σ^{2}, V_{(m)}, U_{(l)} l \neq 1) ~ N (μ_{U}^{'}, Σ_{U}^{'}),

(7)

where

Σ_{U}^{'} = {(\frac{C_{(ℛ \times 𝒞)}^{T} C_{(ℛ \times 𝒞)}}{σ^{2}} + Σ_{U}^{- 1})}^{- 1}, μ_{U}^{'} = Σ_{U}^{'} (\frac{C_{(ℛ \times 𝒞)}^{T} v e c Y}{σ^{2}} + Σ_{U}^{- 1} μ_{U}) .

Figure 2 presents an illustration of updating $U_{(1)}$ .

We then derive the conditional distributions of $V_{(m)}$ given $σ^{2}, {\{U_{(l)}\}}_{l = 1}^{L}, V_{(k)}$ for $k \neq m$ , and $G$ . Without loss of generality, we derive the full conditional posterior distribution of $V_{(1)}$ below.

Denote the contracted product of the tensor $G \times_{1} U_{(1)} \dots \times_{L} U_{(L)} \times_{L + 2} V_{(2)} \dots \times_{L + M} V_{(M)}$ and the tensor $X$ by a new tensor $D$ , where $D \in R^{N \times S_{1} \times Q_{2} \times \dots \times Q_{M}}$ . We then matricize $D$ into a matrix $D_{(ℛ \times 𝒞)} \in R^{N} \prod_{m = 2}^{M} Q_{m} \times S_{1}$ and write

Y_{(2)} = V_{(1)} \times {(D_{(ℛ \times 𝒞)})}^{T} + E_{(2)},

(8)

where $Y_{(2)} \in R^{Q_{1} \times N} \prod_{m = 2}^{M} Q_{m}$ is the mode-2 matricization of tensor $Y$ . The proof of equation (8) can be found in Appendix A.3. Let $\tilde{Y} = Y_{(2)}^{T}$ . Given that $V_{(1)}$ follows a normal distribution with a diagonal covariance matrix, we can rewrite (8) as

v e c \tilde{Y} = (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)}) \times v e c V_{(1)}^{T} + v e c {(E_{(2)})}^{T},

where $I_{Q_{1}}$ denotes an identity matrix of size $Q_{1}$ . Given that the prior distribution of $v e c V_{(1)}$ is a normal $N (μ_{V}, Σ_{V})$ with diagonal $Σ_{V}$ , the prior distribution of $v e c V_{(1)}^{T}$ is also a normal distribution $N ({\tilde{μ}}_{V}, {\tilde{Σ}}_{V})$ with a diagonal covariance matrix. Then the full conditional posterior distribution of $v e c {V_{(1)}}^{T}$ is normally distributed:

p (v e c V_{(1)}^{T} ∣ v e c \tilde{Y}, X, σ^{2}, U_{(l)}, V_{(m)} m \neq 1) ~ N ({\tilde{μ}}_{V}^{'}, {\tilde{Σ}}_{V}^{'}),

(9)

where

{\tilde{Σ}}_{V}^{'} = {(\frac{{(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}{σ^{2}} + {\tilde{Σ}}_{V}^{- 1})}^{- 1}, {\tilde{μ}}_{V}^{'} = {\tilde{Σ}}_{V}^{'} (\frac{{(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} v e c \tilde{Y}}{σ^{2}} + {\tilde{Σ}}_{V}^{- 1} {\tilde{μ}}_{V}) .

(10)

The posterior distribution of the core tensor $G$ is more complex than the posterior of $U_{l}^{'} s$ and $V_{m}^{'} s$ . First, we have

Y_{(1)} = X_{(1)} B_{(𝒫 \times 𝒬)} + E_{(1)} = X_{(1)} (U_{(L)} \otimes \dots \otimes U_{(1)}) G_{(ℛ \times 𝒞)} \times {(V_{(M)} \otimes \dots \otimes V_{(1)})}^{T} + E_{(1)} .

(11)

Let $U$ denote $(U_{(L)} \otimes \dots \otimes U_{(1)})$ , and $V$ denote $(V_{(M)} \otimes \dots \otimes V_{(1)})$ , we have

Y_{(1)} = (X_{(1)} U) G_{(ℛ \times 𝒞)} V^{T} + E_{(1)} .

(12)

If we further let $\tilde{\tilde{Y}}$ denote $Y_{(1)} V {(V^{T} V)}^{- 1}$ , and $\tilde{\tilde{E}}$ denote $E_{(1)} V {(V^{T} V)}^{- 1}$ , we can rewrite (12) as

\tilde{\tilde{Y}} = (X_{(1)} U) G_{(ℛ \times 𝒞)} + \tilde{\tilde{E}},

v e c \tilde{\tilde{Y}} = (I_{S} \otimes (X_{(1)} U)) v e c G_{(ℛ \times 𝒞)} + v e c \tilde{\tilde{E}},

where $I_{S}$ denotes an $S \times S$ identity matrix with $S = \prod_{m = 1}^{M} S_{m}$ , and $v e c \tilde{\tilde{E}}$ is normally distributed with mean 0 and block diagonal covariance matrix $σ^{2} {(V^{T} V)}^{- 1} \otimes I_{N}$ . Then $vec \tilde{\tilde{Y}} ~ N (μ_{\tilde{\tilde{Y}}}, Σ_{\tilde{\tilde{Y}}})$ with

μ_{\tilde{\tilde{Y}}} = (I_{S} \otimes (X_{(1)} U)) \times v e c G_{(ℛ \times 𝒞)}, Σ_{\tilde{\tilde{Y}}} = σ^{2} {(V^{T} V)}^{- 1} \otimes I_{N} .

Given that the prior distribution of $v e c G$ is $N (μ_{G}, Σ_{G})$ with a diagonal covariance $Σ_{G}, v e c G_{(ℛ \times 𝒞)}$ is also normally distributed with ${\tilde{μ}}_{G}$ and diagonal covariance ${\tilde{Σ}}_{G}$ by rearranging elements of $μ_{G}$ and $Σ_{G}$ . Then the full conditional posterior distribution of $v e c G_{(ℛ \times 𝒞)}$ is a normal distribution with

{\tilde{μ}}_{G}^{'} = {\tilde{Σ}}_{G}^{'} ({(I_{S} \otimes (X_{(1)} U))}^{T} {(Σ_{\tilde{\tilde{Y}}})}^{- 1} v e c \tilde{\tilde{Y}} + {\tilde{Σ}}_{G}^{- 1} {\tilde{μ}}_{G}), {\tilde{Σ}}_{G}^{'} = {({(I_{S} \otimes (X_{(1)} U))}^{T} {(Σ_{\tilde{\tilde{Y}}})}^{- 1} (I_{S} \otimes (X_{(1)} U)) + {({\tilde{Σ}}_{G})}^{- 1})}^{- 1} .

(13)

Lastly, deriving the full conditional posterior distribution of $σ^{2}$ is straightforward:

p (σ^{2} ∣ Y, X, {\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G) ~ I G (α^{'}, β^{'}),

(14)

where $α^{'} = α + \frac{N Q}{2}, β^{'} = β + \frac{{∥Y - ⟨ X, B ⟩_{L}∥}_{F}^{2}}{2}$ with $B$ defined in (5), and $Q = \prod_{m = 1}^{M} Q_{m}$ .

4.2. Updating the model dimension

In this subsection, we show how to simultaneously update the dimension of the core tensor and estimate model parameters. Denote $θ = (R_{1}, \dots, R_{L}, S_{1}, \dots, S_{M})$ , and we assign a prior distribution $π (θ)$ to $θ$ , in this study, a uniform distribution over all candidates of $θ$ . Since conditional posterior distribution of $θ$ is not in closed form, we employ a trans-dimensional Metropolis-Hastings (MH) sampler to update $θ$ . The most challenging task is to design a good proposal distribution that can result in a reasonable acceptance rate given the fact that the dimensions of ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}$ , and $G$ change when $θ$ varies.

To address the challenge, we construct our proposal distribution building upon the idea of fractional Bayes factor (O’Hagan, 1995; Lee et al., 2016). Assuming that at iteration $t - 1$ of the MCMC sampler $θ^{(t - 1)} = (R_{1}^{(t - 1)}, \dots, R_{L}^{(t - 1)}, S_{1}^{(t - 1)} \dots, S_{M}^{(t - 1)})$ , at iteration $t$ we generate a candidate $\tilde{θ}$ from the “neighbor” of $θ^{(t - 1)}$ defined as $O (θ^{(t - 1)}) : = \{\tilde{θ} \in Θ : {∥θ^{(t - 1)} - \tilde{θ}∥}_{L_{1}} = 1\}$ , where $Θ$ is the parameter space for $θ$ . In this work, we propose to generate $\tilde{θ}$ uniformly over all candidates in $O (θ^{(t - 1)})$ , denoted by $q (\tilde{θ} ∣ θ^{(t - 1)})$ . To calculate the acceptance rate of the proposed $\tilde{θ}$ in the MH step, we denote $ξ = ({\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G, σ^{2})$ , and write the likelihood function as the multiplication of two parts:

p (Y∣ ξ, θ) = \underset{training}{\underset{⏟}{p (Y ∣ ξ, θ)^{b}}} \times \underset{testing}{\underset{⏟}{p (Y ∣ ξ, θ)^{(1 - b)}}},

where $b$ is small and $0 < b < 1$ . The key idea here is to utilize a fraction $b$ of the data as the training data to propose new parameters $ξ$ associated with $\tilde{θ}$ so that the new values can be accepted with a reasonable acceptance probability. In practice, we usually choose a small value of $b$ , e.g., between 0.01 and 0.1, to obtain a reasonable acceptance rate. In particular, we update $θ^{(t)}$ as follows:

Generate $\tilde{θ}$ from $q (\cdot ∣ θ^{(t - 1)})$ .
Generate $\tilde{ξ}$ from a distribution proportional to
$p (Y ∣ \tilde{ξ}, \tilde{θ})^{b} \times p (\tilde{ξ}∣ \tilde{θ}),$
which is the posterior of $ξ$ based on the training portion conditional on $\tilde{θ}$ . The detailed conditional posterior distributions are shown in Appendix A.4.
Generate $\tilde{\tilde{ξ}}$ from a distribution proportional to
$p {(Y ∣ \tilde{\tilde{ξ}}, θ^{(t - 1)})}^{b} \times p (\tilde{\tilde{ξ}} ∣ θ^{(t - 1)}),$
which is the posterior of $ξ$ based on the training portion conditional on $θ^{(t - 1)}$ .
Accept $\tilde{θ}$ with probability $m i n (1, A (\tilde{θ}, θ^{(t - 1)})$ where
$A (\tilde{θ}, θ^{(t - 1)}) = \frac{π (\tilde{θ}) q (θ^{(t - 1)} ∣ \tilde{θ})}{π (θ^{(t - 1)}) q (\tilde{θ} ∣ θ^{(t - 1)})} \times \underset{(*)}{\underset{⏟}{\frac{p (Y ∣ \tilde{ξ}, \tilde{θ})^{(1 - b)}}{p {(Y ∣ \tilde{\tilde{ξ}}, θ^{(t - 1)})}^{(1 - b)}}}} .$ (15)

The (*) part in equation (15) coincides with the fractional Bayes factor given in O’Hagan (1995). The detailed proof is given in Appendix A.5.

We summarize the full MCMC sampler in Algorithm 1. For predictive inference, it is straightforward based on the posterior samples obtained from Algorithm 1. Given new data $X_{new}$ of $\tilde{N}$ samples, we can easily sample $Y_{new}$ predictions according to

v e c {\hat{Y}}_{n e w} ~ N (vec ({⟨X_{n e w}, \hat{B}⟩}_{L}), {\hat{σ}}^{2} I_{\tilde{N} Q}),

(16)

where $Q = \prod_{m = 1}^{M} Q_{m}$ , and $\hat{B}$ is calculated by (5) using post-burn-in samples of $G, {\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}$ , and $σ^{2}$ .

5. FAST COMPUTING ALGORITHM

In practice, the proposed MCMC sampler involves generating samples from high-dimensional conditional posterior distributions at each iteration, which can be time-consuming. In this section, we propose an ultra-fast optimization-based computing algorithm as an alternative for posterior inference using the maximum a posteriori probability (MAP) estimators. Given the dimension of the core tensor, the MAP estimators of ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G$ , and $σ^{2}$ can be computed in closed forms. Specifically, the MAP estimator of $U_{(1)}$ is

v e c U_{(1)}^{M A P} = {(C_{(ℛ \times 𝒞)}^{T} C_{(ℛ \times 𝒞)} + σ^{2} Σ_{U}^{- 1})}^{- 1} C_{(ℛ \times 𝒞)}^{T} v e c Y,

(17)

when $μ_{U} = 0$ . And if we further set $Σ_{U}$ to be an identity matrix, the result is exactly the solution to the ridge linear regression problem

arg \underset{U_{(1)}}{m i n} {∥Y - ⟨ X, B ⟩_{L}∥}_{F}^{2} - λ {∥U_{(1)}∥}_{2}^{2},

with $λ = σ^{2}$ and $B = ⟦ G; U_{(1)}, \dots, U_{(L)}, V_{(1)}, \dots, V_{(M)} ⟧$ . Similarly, assuming $μ_{V} = 0$ , and $μ_{G} = 0$ , the MAP estimators of $V_{(1)}$ and the core tensor $G$ are given below:

v e c V_{(1)}^{T}^{M A P} = {(I_{Q_{1}} \otimes (D_{(ℛ \times 𝒞)}^{T} D_{(ℛ \times 𝒞)}) + σ^{2} {\tilde{Σ}}_{V}^{- 1})}^{- 1} \times (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)}^{T}) v e c \tilde{Y},

(18)

and

v e c G_{(ℛ \times 𝒞)}^{M A P} = {((V^{T} V) \otimes ({(X_{(1)} U)}^{T} (X_{(1)} U)) + σ^{2} {\tilde{Σ}}_{G}^{- 1})}^{- 1} \times ((V^{T} V) \otimes {(X_{(1)} U)}^{T}) v e c \tilde{\tilde{Y}},

(19)

with the same notations given in Section 4.1. And the MAP estimator of $σ^{2}$ is given by

{(σ^{2})}^{M A P} = \frac{β^{'}}{α^{'} - 1}

(20)

with $α^{'} = α + \frac{N Q}{2}, β^{'} = β + \frac{{∥Y - ⟨ X, B ⟩_{L}∥}_{F}^{2}}{2}$ , and $Q = \prod_{m = 1}^{M} Q_{m}$ . We remark that the unregularized least square results in Lock (2018) is a special case of our MAP results above, when flat priors are given to $U_{(l)}$ ’s and $V_{(m)}$ ’s, and the core tensor is fixed to be a superdiagonal tensor.

By using MAP estimators of ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}$ and $G$ instead of generating samples from high dimensional posterior distributions, the problem of choosing the optimal dimension of the core tensor becomes a discrete optimization problem to find the tuple of parameters over an $(L + M)$ -dimensional grid of parameters $θ : = (R_{1}, \dots, R_{L}, S_{1}, \dots, S_{M}) \in Θ$ that minimizes some loss function. In this work, we use the Bayesian information criterion (BIC) as the loss function.

To solve the discrete optimization problem, we adopt simulated annealing (SA) algorithm (Kirkpatrick et al., 1983), which is a metaheuristic to approximate global optimum in a large search space. Starting from the current guess $θ^{(t)}$ at iteration $t$ , we uniformly generate $\tilde{θ}$ from $O (θ^{(t)})$ . The probability of accepting the new candidate $\tilde{θ}$ is $A (\tilde{θ}, θ^{(t)})$ defined as follows:

A (\tilde{θ}, θ^{(t)}) = \{\begin{array}{l} 1 & if B I C (\tilde{θ}) < B I C (θ^{(t)}), \\ e x p (\frac{B I C (θ^{(t)}) - B I C (\tilde{θ})}{ζ (t)}) & otherwise, \end{array}

(21)

where $ζ (t)$ , a function of the iteration $t$ , is the usual temperature parameter in standard SA algorithm. Two commonlyused choices for $ζ (t)$ are $ζ (t) = γ^{t} ζ_{0}$ (Dosso and Oldenburg, 1991), and $ζ (t) = \frac{ζ_{0}}{l o g (1 + t)}$ (Geman and Geman, 1984), where $ζ_{0}$ is the initial temperature. The detailed procedure of the fast computing algorithm is shown in Algorithm 2.

graphic file with name nihms-1968746-f0002.jpg

As will be shown in numerical studies of Section 6, the proposed Algorithm 2 performs well in terms of parameter estimation, and saves around 90% of the running time in most simulation cases. For example, the average running time of the BayTensor MCMC algorithm on a problem with $X \in R^{100 \times 16 \times 12}$ and $Y \in R^{100 \times 10 \times 8}$ is around 23 minutes on a personal laptop, whereas the BayTensor Fast algorithm takes less than 3 minutes when executed on the same machine. We also remark that we only obtain a point estimate for parameter estimation from Algorithm 2 since it is an optimization-based method. If one wants uncertainty quantification of parameters, we suggest to use the estimated optimal dimension of the core tensor from Algorithm 2, and conduct the MCMC sampler introduced in Section 4.1, i.e., update model parameters given the estimated dimension of the core tensor. Such a treatment is able to not only save the computing time for updating the dimension of the core tensor in Algorithm 1, but also quantify parameter uncertainties.

6. SIMULATION STUDIES

We evaluate the proposed Bayesian tensor-on-tensor regression model and the computational algorithms through extensive simulation studies and comparisons to alternative methods.

6.1. Simulation setup

Assume that the response tensor is a three-way array $Y \in R^{N \times Q_{1} \times Q_{2}}$ and the predictor tensor is another three-way array $X \in R^{N \times P_{1} \times P_{2}}$ . The sample size is set to $N = 100$ . We set $P_{1} = 16, P_{2} = 12, Q_{1} = 10$ , and $Q_{2} = 8$ . We generate simulated data as follows:

Generate $X \in R^{N \times P_{1} \times P_{2}}$ .
Generate $U_{(l)}$ and $V_{(m)}$ for $l = 1, \dots, L$ and $m = 1, \dots, M$ , each with independent $N (0, 1)$ entries.
Generate $G$ with $θ^{*}$ dimensions, and independent $N (0, 1)$ entries.
Generate $E \in R^{N \times Q_{1} \times Q_{2}}$ .
Calculate $Y = ⟨ X, B ⟩_{L} + c E$ , where
$B = ⟦ G; U_{(1)}, \dots, U_{(L)}, V_{(1)}, \dots, V_{(M)} ⟧,$
and $c$ is a scaling parameter for defining the signal-to-noise (SNR) ratio such that
$\frac{{∥⟨ X, B ⟩_{L}∥}_{F}^{2}}{c^{2} ∥ E ∥_{F}^{2}} = S N R .$

We consider three different setups for generating the predictor tensor $X$ and error $E$ . In the first setup, all entries of $X$ and $E$ are independently and identically generated from a normal distribution $N (0, 1)$ , called uncorrelated-normal setup. To test the performance of our algorithms when the simulated data generation model is different from the data fitting model, in the second setup all entries of $X$ are independently and identically generated from a normal $N (0, 1)$ , and entries of $E$ are independently and identically generated from a student-t distribution with a degree of freedom 3. We call this uncorrelated-t setup. In the third setup, we consider the entries for each observation of tensor $X$ to have a correlated structure, which can be seen in some real-world applications such as spatial and temporal data (Lankao et al., 2008). We call it correlated setup. Specifically, for each of the $N$ observations of $X$ , denoted by $X_{(n)} \in R^{P_{1} \times P_{2}}, n = 1, \dots, N$ , the correlation between the $(i, j)$ entry and $(k, l)$ entry of $X_{(n)}$ is $e^{- r \sqrt{| i - k |^{2} + | j - l |^{2}}}$ , where $i, k = 1, \dots, P_{1}, j, l = 1, \dots, P_{2}$ , and $r > 0$ .

In each of the setups, we consider two cases for the simulated true dimension of the core tensor $G : θ^{*} = (3, 3, 3, 3)$ and (4, 4, 2, 2), and two cases for SNR = 2 and 5, yielding a total of four cases. We conduct 50 repeated experiments by generating 50 replicated datasets for each case, and apply the proposed Bayesian tensor-on-tensor method using both the MCMC sampler (BayTensor MCMC) in Algorithm 1 and the fast computing Algorithm 2 (BayTensor Fast) for each dataset.

To evaluate the performance of different methods, for each dataset in each case we generate 5 new datasets with $N_{new} = 1000$ observations. The new observations are given by

Y_{new} = {⟨X_{new}, B⟩}_{L} + c E_{new,}

with $X_{new}$ and $E_{new}$ generated in the same way as $X$ and $E$ . We then calculate the relative prediction error (RPE), defined as the average prediction error for the 5 new datasets:

R P E = \frac{1}{5} \sum_{i = 1}^{5} \frac{{∥Y_{new}^{(i)} - {\hat{Y}}_{new}^{(i)}∥}_{F}^{2}}{{∥Y_{new}^{(i)}∥}_{F}^{2}},

where ${\hat{Y}}_{new}^{(i)}$ is the predicted value for the dataset $i, i = 1, \dots, 5$ . For BayTensor MCMC and BayTnesor Fast algorithms, we also calculate the “Dimension Recovery” rate defined by (the number of experiments where recovered dimension equals the true core tensor dimension) / (total number of experiments) for each simulation setup.

For comparison, we consider four alternative methods:

The tensor-on-tensor regression method based on the CP decomposition from Lock (2018), denoted as the CP method. As the CP method needs to pre-define or estimate the rank $R$ , we run their algorithm with different $R$ ’s and report the result under the optimal $R$ value, defined as the one yielding the smallest RPE.
Multivariate linear regression method after turning tensors to vectors to solve $B_{(𝒫 \times 𝒬)}$ in equation (4), denoted as the OLS method.
Bayesian multivariate linear regression method with the rescaled spike and slab algorithm prior (Ishwaran and Rao, 2005) for inducing sparsity after turning tensors to vectors, denoted as the SAS (Spike-And-Slab) method.
Bayesian multivariate linear regression method with the horseshoe prior (Carvalho et al., 2010) for inducing shrinkage after turning tensors to vectors, denoted as the HS (horseshoe) method.

6.2. Simulation results: uncorrelated setup

We first apply the proposed BayTensor MCMC with $b = 0.05$ , BayTensor Fast, and four alternative methods (including the CP method, the OLS method, the SAS method, and the HS method) to datasets under the uncorrelated-normal setup. For the CP method, we run the algorithm with $R = 1, \dots, 6$ . The reason why we choose 6 as the upper bound is that when $R = 6$ , the total number of parameters in the CP method is 276 which is larger than the true number of parameters 212 for the $θ^{*} = (4, 4, 2, 2)$ case and 219 for the $θ^{*} = (3, 3, 3, 3)$ case. The CP method with $R = 6$ yields the smallest RPE under both cases in all repeated simulations. Full results from the CP method with different $R$ values are shown in Appendix A.9.

Table 1 reports the means and standard deviations (sd) of RPEs averaging over 50 replicated experiments for each of the four cases under the six methods. No matter when the information in $X$ and $Y$ is balanced (i.e., $θ^{*} = (3, 3, 3, 3))$ or skewed to some modes (i.e., $θ^{*} = (4, 4, 2, 2)$ ), BayTensor MCMC and BayTensor Fast always yield smaller prediction RPEs than the CP method, while the OLS method, the SAS method and the HS method all yield much larger RPEs. For example, when $θ^{*} = (4, 4, 2, 2)$ with SNR = 2, BayTensor MCMC and BayTensor Fast have comparable results with mean RPEs being 0.350 and 0.357 respectively, while the CP method has a slightly larger mean RPE of 0.377, and the OLS method, the SAS method, and the HS method all have much larger mean RPE of 1.016, 0.970, and 1.114 respectively. The superior performance of the Bayesian tensor-on-tensor regression method over the CP method comes from the flexibility of Tucker decomposition that permits different orders along different modes of the core tensor. From Table 1, we can also see that a larger signal to noise ratio leads to lower RPEs. In particular, when SNR = 5, all methods, except for the HS method, have lower RPEs compared to the results from cases where SNR = 2.

Table 1.

Mean RPE with SD on uncorrelated-normal data

RPE (SD)	BayTensor MCMC	BayTensor Fast	CP Method	OLS Method	SAS Method	HS Method
$θ * = (4, 4, 2, 2)$ , SNR=2	0.350 (0.013)	0.357 (0.026)	0.377 (0.014)	1.016 (0.034)	0.970 (0.025)	1.114 (0.310)
$θ * = (3, 3, 3, 3)$ , SNR=2	0.343 (0.004)	0.351 (0.015)	0.391 (0.016)	1.023 (0.031)	0.963 (0.020)	1.079 (0.153)
$θ * = (4, 4, 2, 2)$ , SNR=5	0.173 (0.001)	0.185 (0.034)	0.209 (0.046)	0.751 (0.030)	0.946 (0.038)	1.182 (0.410)
$θ * = (3, 3, 3, 3)$ , SNR=5	0.172 (0.001)	0.181 (0.037)	0.222 (0.016)	0.748 (0.023)	0.934 (0.031)	1.136 (0.240)

Open in a new tab

As the proposed BayTensor MCMC method and BayTensor Fast method can simultaneously estimate the dimension of the core tensor and other model parameters, we next report the empirical probabilities that the true dimension of the core tensor can be recovered by them in 50 replicated experiments for all 4 cases, as shown in Table 2. BayTensor MCMC has higher recovering rates than BayTensor Fast in all cases. And we can better recover the core tensor dimension with a higher SNR. Another observation is that both BayTensor MCMC and BayTensor Fast have higher recovering rates when $θ^{*} = (3, 3, 3, 3)$ compared with the cases where $θ^{*} = (4, 4, 2, 2)$ .

Table 2.

Core tensor dimension recovery and number of model parameters on uncorrelated-normal data

	BayTensor MCMC		BayTensor Fast
	Dimension Recovery	# Parameters (SD)	Dimension Recovery	# Parameters (SD)
$θ * = (4, 4, 2, 2)$ , SNR=2	80%	207 (15)	46%	189 (14)
$θ * = (3, 3, 3, 3)$ , SNR=2	98%	218 (6)	54%	193 (18)
$θ * = (4, 4, 2, 2)$ , SNR=5	96%	215 (14)	64%	209 (16)
$θ * = (3, 3, 3, 3)$ , SNR=5	100%	219 (0)	76%	215 (16)

Open in a new tab

We also present the average numbers of parameters required by BayTensor MCMC and BayTensor Fast in 50 replicated experiments in Table 2. BayTensor MCMC and BayTensor Fast usually require smaller numbers of parameters than the CP method. For example, when $θ^{*} = (4, 4, 2, 2)$ with SNR = 2, BayTensor MCMC requires 207 parameters on average, and BayTensor Fast requires 196 parameters on average. In contrast, the CP method with $R = 6$ has 276 parameters. When $θ^{*} = (3, 3, 3, 3)$ with SNR = 2, the average numbers of parameters are 218, 193, and 276 for BayTensor MCMC, BayTensor Fast, and the CP method respectively. We can see that both BayTensor MCMC and BayTensor Fast yield higher dimension recovery rates than the CP method with smaller numbers of parameters.

To quantify the prediction uncertainty, for each test dataset under each simulation setup, we collect 1000 post-burn-in samples from BayTensor MCMC and generate the empirical posterior predictive distribution for each element of $Y_{new}$ . The symmetric credible intervals are then calculated. For cases with $θ^{*} = (4, 4, 2, 2)$ , the empirical coverage rates for 95% credible interval are 0.944 (0.013) and 0.945 (0.012) for SNR = 2 and SNR = 5 respectively. And for cases with $θ^{*} = (3, 3, 3, 3)$ , the empirical coverage rates for 95% credible interval are 0.950 (0.011) and 0.950 (0.011) for SNR = 2 and SNR = 5 respectively. Figure 3(a) plots the predictive posterior estimations with 95% credible intervals for a randomly-selected sample of $Y_{new}$ under a randomly-selected test case when $θ^{*} = (3, 3, 3, 3)$ and SNR = 5. Figure 3(b) plots the empirical predictive distributions from BayTensor MCMC for 8 randomly selected elements from the same sample of $Y_{new}$ shown in Figure 3(a).

Figure 3. — Uncertainty quantification for a randomly selected case where $θ^{*} = (3, 3, 3, 3)$ and SNR = 5. (a): An example of prediction with estimation uncertainty. Elements of $Y_{[1 : :]}$ are sorted for visualization. (b): Empirical predictive distributions from BayTensor MCMC for randomly selected elements from $Y_{new}$ with the simulation truths shown as red lines.

We then apply all six methods to datasets under the uncorrelated-t setup. The results are very similar to those under the uncorrelated-normal setup, suggesting that our proposed algorithms have reasonably good performance under model mis-specification when the data fitting model is different from the data generating model. The RPEs and dimension recovery results are presented in Table A.1 and Table A.2 of Appendix A.7, respectively.

6.3. Simulation results: correlated setup

We then apply the proposed BayTensor MCMC method with $b = 0.05$ , the BayTensor Fast method, the CP method, the OLS method, the SAS method, and the HS method to datasets under the correlated setup. For the CP method, we run the algorithm with $R = 1, \dots, 6$ . The CP method with $R = 6$ yields the smallest RPE under all cases except for the case of $θ^{*} = (4, 4, 2, 2)$ and SNR = 2 where the optimal $R = 5$ . Full results from the CP method with different $R$ values are shown in Appendix A.9. For each replicated dataset, we compute the RPEs under all four methods. The means and standard deviations (sd) of RPEs averaging over the 50 replicated datasets for all 4 cases are presented in Table 3.

Table 3.

Mean RPE with SD on correlated data

RPE (SD)	BayTensor MCMC	BayTensor Fast	CP Method	OLS Method	SAS Method	HS Method
$θ * = (4, 4, 2, 2)$ , SNR=2	0.344 (0.004)	0.350 (0.006)	0.370(0.011)	0.886 (0.086)	0.542 (0.115)	1.380 (0.540)
$θ * = (3, 3, 3, 3)$ , SNR=2	0.344 (0.004)	0.348 (0.006)	0.369(0.007)	0.893 (0.065)	0.532 (0.121)	1.366 (0.783)
$θ * = (4, 4, 2, 2)$ , SNR=5	0.173 (0.002)	0.175 (0.003)	0.188(0.006)	0.492 (0.086)	0.397 (0.141)	1.466 (0.648)
$θ * = (3, 3, 3, 3)$ , SNR=5	0.173 (0.008)	0.173 (0.003)	0.187(0.006)	0.497 (0.059)	0.384 (0.150)	1.479 (1.003)

Open in a new tab

The prediction RPE results under the correlated setup are analogous to the results under the uncorrelated setup. To summarize, the RPE results from Baytensor MCMC and BayTensor Fast are comparable in all cases. And both of them have better RPEs than the CP method followed by the SAS method, and the OLS method. The HS performs the worst in terms of yielding the highest RPEs in all cases. And when the signal to noise ratio increases from 2 to 5, the prediction accuracies are improved under all methods, except for the HS method.

Table 4 shows the empirical probabilities of dimension recovery and the average numbers of parameters required by BayTensor MCMC and BayTensor Fast in 50 replicated experiments. In all 4 cases, BayTensor MCMC and BayTensor Fast both require a smaller number of parameters than the CP method. And in terms of recovering the dimension of the core tensor, BayTensor MCMC has higher empirical probabilities of recovering the true dimension than BayTensor Fast in all cases. And the recovering probabilities in the cases where $θ^{*} = (3, 3, 3, 3)$ are higher than those in cases where $θ^{*} = (4, 4, 2, 2)$ for both methods. We also observe that the recovering probabilities under the correlated setup are smaller than those under the uncorrelated setup for both the BayTensor MCMC and BayTensor Fast.

Table 4.

Core tensor dimension recovery and number of model parameters on correlated data

	BayTensor MCMC		BayTensor Fast
	Dimension Recovery	# Parameters (SD)	Dimension Recovery	# Parameters (SD)
$θ * = (4, 4, 2, 2)$ , SNR=2	72%	207 (17)	36%	196 (16)
$θ * = (3, 3, 3, 3)$ , SNR=2	86%	214 (14)	32%	164 (20)
$θ * = (4, 4, 2, 2)$ , SNR=5	96%	212 (7)	44%	188 (15)
$θ * = (3, 3, 3, 3)$ , SNR=5	96%	218 (6)	56%	196 (15)

Open in a new tab

7. REAL DATA ANALYSES

We demonstrate the usefulness of the proposed Bayesian tensor-on-tensor regression approach by applying it to two real-world datasets: labeled faces in the wild database (Huang et al. (2007)) and multi-person motion (UMPM) benchmark (Van der Aa et al., 2011).

7.1. Facial image data

We apply the proposed Bayesian tensor-on-tensor regression approach, with $b = 0.05$ , to predict different attributes of facial images, such as smiling and nose size, using the labeled faces in the wild database (Huang et al. (2007)). The database collects more than 13,000 face images, each of which has been labeled with the name of the person pictured, often a celebrity. There can be multiple images for one person.

We use the estimation results from the attribute classifiers developed in Kumar et al. (2011) for response $Y$ , resulting in a total of 73 describable attributes. These attributes can be categorized into characteristics that describe a person, an expression, or an accessory. The attributes are all given as continuous variables, with a higher value denoting a more obvious characteristic. The proposed approach is applied to predict the 73 attributes from a given facial image $X$ . In this work, we use the frontalized version of facial images (Hassner et al., 2015), which show only forward-facing faces obtained by rotating, scaling, and cutting original facial images. The frontalized images are highly aligned, allowing for appearances to be easily compared across faces. Each frontalized image contains 90 × 90 pixels, with each pixel giving color intensities for red, green, and blue, resulting in a 90 × 90 × 3 tensor for each image. We randomly sample 1000 images. Thus the predictor tensor $X$ is of dimensions 1000 × 90 × 90 × 3, and the response tensor $Y$ is of dimensions 1000 × 73. We center the tensor for each image by subtracting the mean of tensors for all images. Another randomly-sampled 1000 images are used as a validation set, in other words, $X_{n e w}$ is of dimensions 1000 × 90 × 90 × 3, and $Y_{n e w}$ is of dimensions 1000 × 73.

We apply the BayTensor MCMC in Algorithm 1 and BayTensor Fast in Algorithm 2 to the dataset and conduct inference. For comparison, we also apply the CP method proposed by Lock (2018) to the same dataset. For the CP method, we choose $R = 15$ and $λ = 10^{5}$ since these values yielded the best prediction performance, as reported in Lock (2018). BayTensor MCMC estimates the dimension of the core tensor based on the posterior mode to be (5, 2, 3, 5) with a total of 1154 parameters, resulting in an RPE of 0.375. And BayTensor Fast yields an RPE of 0.446. In contrast, the CP method has 3840 parameters and results in a higher RPE, 0.477. To summarize, the proposed Bayesian tensor-on-tensor regression approach is able to reduce predictive errors with a smaller number of parameters due to the flexibility of the Tucker decomposition that permits different orders along different modes.

Next we report the prediction uncertainty, which is a natural byproduct of the proposed Bayesian framework. We collect 500 post-burn-in MCMC samples and compute the posterior predictive distribution along with credible intervals for each of the 73 attributes for each image. The empirical coverage rate for 95% credible interval is 0.930, and for 90% credible interval is 0.889. Figure 4 shows an example of a test image, and plots the corresponding posterior predictive values for four randomly-selected characteristics.

Figure 4. — An example of a test image and the corresponding posterior predictive values for four selected characteristics.

7.2. Utrecht multi-person motion (UMPM) data

We then apply the proposed Bayesian tensor-on-tensor regression model, with $b = 0.05$ , to the multi-person motion (UMPM) benchmark (Van der Aa et al., 2011) that contains temporally synchronized video sequences from multiple viewpoints and human motion capture data. Each video is of length 30 to 60 seconds, with resolution of 644 × 484 pixels at 50 fps (frames per second). Motion capture (Mo-Cap) data contain 3D positions of 37 markers at 100 fps for each subject of interest.

To evaluate the performance of our model on 3D motion data, we consider two scenarios, namely ‘chair’ and ‘triangle’ from the UMPM dataset. In the scenario ‘chair’, we have a sequence of 2570 sample images in which the subject of interest starts walking in a circle, finds the chair, sits on the chair and stands up multiple times with different postures. In the scenario ‘triangle’, we have a sequence of 2471 sample images in which the subject walks by following the path of a triangle within a circular area. In our data analysis, the input data was a grayscale image sequence from the front camera with resolution downsized to 32 × 24 pixels, forming a 3-order predictor tensor (i.e. frames × width × height). The response data containing 3D positions of 37 markers is first downsampled to 50fps to match the video, and is then presented as a 3-order tensor (i.e. samples × 3D position × markers). For each scenario, we run 10 repeated experiments, and randomly sample 200 images from the sequence in each experiment to form the predictor tensor data $X$ of dimensions 200 × 32 × 24 and response tensor data $Y$ of dimensions 200 × 3 × 37. The remaining images are used as testing data $X_{new}$ and $Y_{new}$ .

We apply BayTensor MCMC, BayTensor Fast, and the CP method to the datasets. For scenario ‘chair’, the mean prediction RPE of BayTensor MCMC is 0.176 (sd: 0.009) while that of BayTensor Fast and the CP method are 0.206 (sd: 0.053) and 0.254 (sd: 0.064) respectively. The mean number of parameters for BayTensor MCMC is 752 (sd: 58), for BayTensor Fast, the number is 798 (sd: 132), and for the CP method, with $R = 10$ , the number of parameters is 960. For scenario ‘triangle’, BayTensor MCMC yields a mean RPE of 0.243 (sd: 0.049) with the mean number of parameters being 685 (sd: 127). BayTensor Fast, with the mean number of parameters being 798 (sd: 132), has a slightly larger RPE of 0.252 (sd: 0.028). In contrast, the CP method, with $R = 10$ , has 960 parameters and results in a worse RPE, 0.346 (sd: 0.049). To summarize, with a smaller number of estimating parameters, the proposed Bayesian tensor-on-tensor regression model is able to yield smaller predictive errors compared to the CP method. And this improvement comes from the flexibility of Tucker decomposition.

To demonstrate the predictive uncertainty, we generate 500 post-burn-in posterior samples, produce the empirical posterior predictive distribution of 3D position for each of the 37 markers, and calculate the symmetric credible intervals. The empirical coverage rate for 95% credible interval is 0.957 for scenario ‘chair’, and 0.931 for scenario ‘triangle’. Figure 5 shows an example of estimation results for 3D position of the #3 marker by BayTensor MCMC method with 95% credible intervals, BayTensor Fast, and the CP method.

Figure 5. — Predictions of x, y, z-position of marker 3 by Bayesian tensor-on-tensor regression method with 95% credible intervals, the fast computing method, and the CP method. The underlying truths are shown in red.

8. DISCUSSION

We developed a Bayesian tensor-on-tensor regression model to predict one tensor response from another tensor predictor, building upon the flexible Tucker decomposition of the coefficient tensor so that the response tensor and predictor tensor can have different dimensions in the core tensor. For posterior inference, we proposed an efficient MCMC algorithm to simultaneously estimate the model dimension and model parameters. In addition, we developed an ultrafast computing algorithm wherein the MAP estimators of model parameters are computed given the model dimension, and then an SA algorithm is used to find the optimal model dimension. Both simulation studies and real data analyses show that the performance of our Bayesian tensor-on-tensor regression model benefits from the flexibility in the core tensor structure compared to alternative methods. And the fast computing algorithm yielded comparable prediction results to the MCMC sampler, meanwhile saved a significant amount of computing time. As the fast computing algorithm is only guaranteed to converge to a local optimum, in practice it can be repeatedly applied with different initial values to search for the global optimum. To our best knowledge, this work represents the first effort in literature to simultaneously estimate the model dimension and parameters in tensor-on-tensor regression setup under a fully Bayesian framework.

There are several interesting future directions. First, instead of assigning normal priors to factor matrices, the proposed framework can easily incorporate other priors, such as sparsity-inducing priors (Guhaniyogi et al., 2017; Miranda et al., 2018), for applications where the sparsity is desired. More general covariance structures for $E$ will also be considered. Second, this work only considers one tensor predictor. Some applications may require to predict one tensor response from multiple tensor predictors or mixed-type predictors including tensors, matrices, and vectors. We will extend the proposed model to handle more flexible predictors. Third, the proposed BayTensor Fast algorithm provides a general algorithmic framework for estimating the model dimension as well as model parameters. In this work, we consider the SA algorithm and BIC. As a future direction, we plan to explore alternative optimization methods beyond the SA and target functions beyond the BIC. Finally, the computational cost of the proposed MCMC sampler comes as a price for prediction accuracy and uncertainty quantification compared to the proposed fast computing algorithm. To improve the efficiency of posterior computation, we plan to take advantage of certain advanced MCMC techniques, e.g., stochastic variational inference (Hoffman et al., 2013) and pseudo-marginal Metropolis-Hastings algorithms (Andrieu and Roberts, 2009).

Figure 1. — Illustration of Tucker decomposition. Here, the core tensor is of dimension (4, 3, 3).

ACKNOWLEDGMENTS

This work was supported by NSF 1940107, NSF 1918854, and NIH R01MH128085 to Dr. Xu.

APPENDIX

A.1. A brief review of matrix Kronecker product

The Kronecker product is a matrix operation that is important in showing posterior distribution of parameters in this paper, and we will briefly review it here.

The Kronecker product of matrices $A \in R^{I \times J}$ , and $B \in R^{K \times L}$ is denoted by $A \otimes B$ . And the result is of size $(I K) \times (J L)$ defined by

A \otimes B = [\begin{matrix} a_{11} B & a_{12} B & \dots & a_{1 J} B \\ a_{21} B & a_{22} B & \dots & a_{2 J} B \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{I 1} B & a_{I 2} B & \dots & a_{I J} B \end{matrix}] .

Some of the properties of Kronecker product are proved useful for this paper. See the detailed proofs of these properties in Kolda (2006)

(A \otimes B) (C \otimes D) = A C \otimes B D .

Let $X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ , and $𝒩 = {1, \dots, N}$ . Let $A_{(n)} \in R^{I_{n} \times J_{n}}$ be a sequence matrices for all $n \in 𝒩$ . Let the ordered sets $ℛ = \{r_{1}, \dots, r_{L}\}$ and $𝒞 = \{c_{1}, \dots, c_{M}\}$ be a partition of $𝒩$ , then if

X = Y \times_{1} A_{(1)} \times_{2} A_{(2)} \dots \times_{N} A_{(N)}

we have

X_{(ℛ \times 𝒞)} = (A_{(r_{L})} \otimes \dots \otimes A_{(r_{1})}) Y_{(ℛ \times 𝒞)} \times {(A_{(c_{M})} \otimes \dots \otimes A_{(c_{1})})}^{T} .

Consequently, if $A_{(n)} \in R^{I_{n} \times J_{n}}$ for all $n \in 𝒩$ , then for any specific $n \in 𝒩$ if we have

X = Y \times_{1} A_{(1)} \times_{2} A_{(2)} \dots \times_{N} A_{(N)},

and then

X_{(n)} = A_{(n)} Y_{(n)} \times {(A_{(N)} \otimes \dots \otimes A_{(n + 1)} \otimes A_{(n - 1)} \otimes \dots \otimes A_{(1)})}^{T} .

(22)

A.2. Proof for equation (6)

By properties of n-mode product of tensor and Tucker decomposition, we have

B = G \times_{2} U_{(2)} \dots \times_{L} U_{(L)} \times_{L + 1} V_{(1)} \dots \times_{L + M} V_{(M)} \times_{1} U_{(1)} .

B = B_{(-)} \times_{1} U_{(1)} .

Therefore,

B_{[p_{1}, \dots, p_{L}, q_{1}, \dots, q_{M}]} = \sum_{r_{1} = 1}^{R_{1}} ({(B_{(-)})}_{[r_{1}, p_{2}, \dots, p_{L}, q_{1}, \dots, q_{M}]} U_{(1) r_{1} p_{1}}) .

And

{\hat{Y}}_{[n, q_{1}, \dots, q_{M}]} = {({〈 X, B 〉}_{L})}_{[n, q_{1}, \dots, q_{M}]} = \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} B_{[p_{1}, \dots, p_{L}, q_{1}, \dots, q_{M}]} X_{[n, p_{1}, \dots, p_{L}]} = \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} \sum_{r_{1} = 1}^{R_{1}} ({(B_{(-)})}_{[r_{1}, p_{2}, \dots, p_{L}, q_{1}, \dots, q_{M}]} \times U_{(1) r_{1} p_{1}} X_{[n, p_{1}, \dots, p_{L}]}) = \sum_{r_{1} = 1}^{R_{1}} \sum_{p_{1} = 1}^{P_{1}} (\sum_{p_{2} = 1}^{P_{2}} \dots \sum_{p_{L} = 1}^{P_{L}} {(B_{(-)})}_{[r_{1}, p_{2}, \dots, p_{L}, q_{1}, \dots, q_{M}]} X_{[n, p_{1}, \dots, p_{L}]}) U_{(1) r_{1} p_{1}} = \sum_{r_{1} = 1}^{R_{1}} \sum_{p_{1} = 1}^{P_{1}} {(B_{(-)}, X_{P_{2}, \dots, P_{N}})}_{[n, p_{1}, r_{1}, q_{1}, \dots, q_{M}]} U_{(1) r_{1} p_{1}} .

And further,

v e c \hat{Y} = C_{(ℛ \times 𝒞)} \times v e c U_{(1)} .

If we denote the contracted product of ${⟨B_{(-)}, X⟩}_{P_{2}, \dots, P_{N}}$ by a new tensor called $C$ , then tensor $C \in R^{R_{1} \times N \times P_{1} \times Q_{1} \times \dots Q_{M}}$ . And matricize tensor $C$ to $C_{(ℛ \times 𝒞)} \in R^{N} \prod_{m = 1}^{M} Q_{m} \times R_{1} P_{1}$ , where $ℛ = \{N, Q_{1}, \dots, Q_{M}\}$ , and $𝒞 = \{R_{1}, P_{1}\}$ . We have

v e c Y = C_{(ℛ \times 𝒞)} \times v e c U_{(1)} + v e c E .

A.3. Proof for equation (8)

We denote tensor $D_{(-)} = G \times_{1} U_{(1)} \dots \times_{L} U_{(L)} \times_{L + 2} V_{(2)} \dots \times_{L + M} V_{(M)}$ . And then the contracted product of tensor $D_{(-)}$ and $X$ is a new tensor denoted as $D$ . Then

B = D_{(-)} \times_{L + 1} V_{(1)} .

Then

B_{[p_{1}, \dots, p_{L}, q_{1}, \dots, q_{M}]} = \sum_{s_{1} = 1}^{S_{1}} D_{(-) [p_{1}, \dots, p_{L}, s_{1}, q_{2}, \dots, q_{M}]} V_{(1) s_{1} q_{1}},

and

{\hat{Y}}_{[n, q_{1}, \dots, q_{M}]} = {({〈 X, B 〉}_{L})}_{[n, q_{1}, \dots, q_{M}]} = \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} B_{[p_{1}, \dots, p_{L}, q_{1}, \dots, q_{M}]} X_{[n, p_{1}, \dots, p_{L}]} = \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} \sum_{s_{1} = 1}^{S_{1}} ({(D_{(-)})}_{[p_{1}, \dots, p_{L}, s_{1}, q_{2}, \dots, q_{M}]} \times V_{(1) s_{1} q_{1} X_{[n, p_{1}, \dots, p_{L}]})} = \sum_{s_{1} = 1}^{S_{1}} \sum_{p_{1} = 1}^{P_{1}} \dots \sum_{p_{L} = 1}^{P_{L}} ({(D_{(-)})}_{[p_{1}, \dots, p_{L}, s_{1}, q_{2}, \dots, q_{M}]} X_{[n, p_{1}, \dots, p_{L}]}) V_{(1) s_{1} q_{1}} .

(23)

By equation (23), we have

\hat{Y} = D \times_{L + 1} V_{(1)} .

(24)

Combine (22) and equation (24) resulting in

Y_{(2)} = V_{(1)} \times {(D_{(ℛ \times 𝒞)})}^{T} + E_{(2)},

where $Y_{(2)} \in R^{Q_{1} \times N} \prod_{m = 2}^{M} Q_{m}$ is the matricization of tensor $Y$ and $D_{(ℛ \times 𝒞)} \in R^{N} \prod_{m = 2}^{M} Q_{m} \times S_{1}$ is the matricization of tensor $D$ .

A.4. Conditional posterior distributions given training fraction $b$

Given the dimension $θ = (R_{1}, \dots, R_{L}, S_{1}, \dots, S_{M})$ of the core tensor $G$ , and the training fraction $b$ , we first derive the full conditional posterior distributions of ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G, σ^{2}$ in closed forms. Without loss of generality, we first derive the full conditional posterior distribution of $U_{(1)}$ . The full conditional posterior distributions of $\{U_{(2)}, \dots, U_{(L)}\}$ can be derived in the same manner.

By equation (6), we have

v e c Y = C_{(ℛ \times 𝒞)} \times v e c U_{(1)} + v e c E,

where $ℛ = \{N, Q_{1}, \dots, Q_{M}\}$ , and $𝒞 = \{R_{1}, P_{1}\}$ . And combined with the idea that $U_{(1)}$ is from a distribution proportional to $p {(Y ∣ {\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G, σ^{2}, θ)}^{b} \times p (U_{(1)} ∣ θ)$ , we have, the posterior distribution of $v e c U_{(1)}$ given all other parameters and $b$ is normal distribution. That is

v e c U_{(1)} ~ N (μ_{U}^{'}, Σ_{U}^{'})

(25)

where,

Σ_{U}^{'} = {(\frac{b \times C_{(ℛ \times 𝒞)}^{T} C_{(ℛ \times 𝒞)}}{σ^{2}} + Σ_{U}^{- 1})}^{- 1}, μ_{U}^{'} = Σ_{U}^{'} (\frac{b \times C_{(ℛ \times 𝒞)}^{T} v e c Y}{σ^{2}} + Σ_{U}^{- 1} μ_{U}) .

We then derive the conditional posterior distributions of $V_{(m)}$ given $σ^{2}, {\{U_{(l)}\}}_{l = 1}^{L}, V_{(k)}$ for $k \neq m$ , and $G$ . Without loss of generality, we derive the full conditional posterior distribution of $V_{(1)}$ below.

Denote the contracted product of the tensor $G \times_{1} U_{(1)} \dots \times_{L} U_{(L)} \times_{L + 2} V_{(2)} \dots \times_{L + M} V_{(M)}$ and tensor $X$ by a new tensor $D$ , where $D \in R^{N \times S_{1} \times Q_{2} \times \dots \times Q_{M}}$ . We then matricize $D$ into a matrix $D_{(ℛ \times 𝒞)} \in R^{N} \prod_{m = 2}^{M} Q_{m} \times S_{1}$ and write

Y_{(2)} = V_{(1)} \times {(D_{(ℛ \times 𝒞)})}^{T} + E_{(2)},

(26)

where $Y_{(2)} \in R^{Q_{1} \times N} \prod_{m = 2}^{M} Q_{m}$ is the matricization of tensor $Y$ . Let $\tilde{Y} = Y_{(2)}^{T}$ , given $V_{(1)}$ follows a normal distribution with a diagonal covariance matrix, we can rewrite (26) as

v e c \tilde{Y} = (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)}) \times v e c V_{(1)}^{T} + v e c {(E_{(2)})}^{T},

where $I_{Q_{1}}$ denotes an identity matrix of size $Q_{1}$ . Given the prior distribution of $v e c V_{(1)}$ is a normal $N (μ_{V}, Σ_{V})$ with diagonal $Σ_{V}$ , the prior distribution of $v e c V_{(1)}^{T}$ is also a normal distribution $N ({\tilde{μ}}_{V}, {\tilde{Σ}}_{V})$ with a diagonal covariance matrix. Then the full conditional posterior distribution of $v e c V_{(1)}^{T}$ is also normally distributed:

p (v e c V_{(1)}^{T} ∣ v e c Y, X, σ^{2}, U_{(l)}, V_{(m)} m \neq 1) ~ N ({\tilde{μ}}_{U}^{'}, {\tilde{Σ}}_{U}^{'}),

(27)

where

{\tilde{Σ}}_{V}^{'} = {(\frac{b \times {(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}{σ^{2}} + {\tilde{Σ}}_{V}^{- 1})}^{- 1} {\tilde{μ}}_{V}^{'} = {\tilde{Σ}}_{V}^{'} (\frac{b \times {(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} v e c \tilde{Y}}{σ^{2}} + {\tilde{Σ}}_{V}^{- 1} {\tilde{μ}}_{V}) .

(28)

By analogous procedures, we have the posterior distribution of $v e c G_{(ℛ \times 𝒞)}$ is also normal distribution with

μ_{G}^{'} = Σ_{G}^{'} ({(I_{S} \otimes (X_{(1)} U))}^{T} {(Σ_{\tilde{Y}})}^{- 1} (b \times v e c \tilde{Y}) + Σ_{G}^{- 1} μ_{G}) Σ_{G}^{'} = {(b \times {(I_{S} \otimes (X_{(1)} U))}^{T} {(Σ_{\tilde{Y}})}^{- 1} (I_{S} \otimes (X_{(1)} U)) + {(Σ_{G})}^{- 1})}^{- 1},

(29)

where $I_{S}$ denotes an $S \times S$ identity matrix with $S = \prod_{m = 1}^{M} S_{m}$ . And for $σ^{2}$ , the posterior given all other parameters and $b$ also follows inverse gamma distribution. That is

σ^{2} ~ I G (α^{'}, β^{'})

(30)

with $α^{'} = α + \frac{b \times N Q}{2}, β^{'} = β + \frac{b \times {∥Y - ⟨ X, B ⟩_{L}∥}_{F}^{2}}{2}$ , and $Q = \prod_{m = 1}^{M} Q_{m}$ .

A.5. Proof for equation (15)

The Fractional Bayes Factor, Eq(11) of O’Hagan (1995), is given by

B_{b} (Y) = \underset{(* *)}{\underset{︸}{\frac{\int p (\tilde{ξ} | \tilde{θ}) p (Y | \tilde{ξ}, \tilde{θ}) d \tilde{ξ}}{\int p (\tilde{ξ} | \tilde{θ}) p {(Y | \tilde{ξ}, \tilde{θ})}^{b} d \tilde{ξ}}}} \times \frac{\int^{​} p (\tilde{\tilde{ξ}} ∣ θ^{(t - 1)}) p {(Y ∣ \tilde{\tilde{ξ}}, θ^{(t - 1)})}^{b} d \tilde{\tilde{ξ}}}{\int^{​} p (\overset{}{\tilde{\tilde{ξ}} ∣ θ^{(t - 1)}) p (Y ∣ \tilde{\tilde{ξ}}, θ^{(t - 1)}) d \tilde{\tilde{ξ}}} .} .

We note that

(* *) = \frac{\int \overset{(* * *)}{\overset{⏞}{p (\tilde{ξ} ∣ \tilde{θ}) p (Y ∣ \tilde{ξ}, \tilde{θ})^{b}}} p (Y ∣ \tilde{ξ}, \tilde{θ})^{(1 - b)} d \tilde{ξ}}{\int p (\tilde{ξ} ∣ \tilde{θ}) p (Y ∣ \tilde{ξ}, \tilde{θ})^{b} d \tilde{ξ}} .

By similar techniques in O’Hagan (1995), we can rewrite $p (Y ∣ \tilde{ξ}, \tilde{θ})^{b}$ as $p (Y^{'} ∣ \tilde{ξ}, \tilde{θ})$ that is the likelihood based on $Y^{'}$ which is the training proportion $b$ of data $Y$ . And rewrite (***) as

(* * *) = p (\tilde{ξ} ∣ \tilde{θ}) p (Y^{'} ∣ \tilde{ξ}, \tilde{θ}) = p (\tilde{ξ} ∣ Y^{'}, \tilde{θ}) \int p (\tilde{ξ} ∣ \tilde{θ}) p (Y^{'} ∣ \tilde{ξ}, \tilde{θ}) d \tilde{ξ} = p (\tilde{ξ} ∣ Y^{'}, \tilde{θ}) \int p (\tilde{ξ} ∣ \tilde{θ}) p (Y ∣ \tilde{ξ}, \tilde{θ})^{b} d \tilde{ξ} .

Then $B_{b} (Y)$ becomes

B_{b} (Y) = \frac{\int p (\tilde{ξ} ∣ Y^{'}, \tilde{θ}) p (Y ∣ \tilde{ξ}, \tilde{θ})^{(1 - b)} d \tilde{ξ}}{\int p (\tilde{\tilde{ξ}} ∣ Y^{'}, θ^{(t - 1)}) p {(Y ∣ \tilde{\tilde{ξ}}, θ^{(t - 1)})}^{(1 - b)} d \tilde{\tilde{ξ}}} .

(31)

And (*) in (15) is evaluating the Fractional Bayes Factor $B_{b} (Y)$ at one sample of $\tilde{ξ}$ and $\tilde{\tilde{ξ}}$ instead of integrating out as in (31).

A.6. More general covariance structures for prior distributions and noise

In the main manuscript, we assign normal priors with diagonal covariance matrices for ${\{U_{(l)}\}}_{l = 1}^{L}, {\{V_{(m)}\}}_{m = 1}^{M}, G$ , and show that the full conditional posterior distributions can be derived in closed-forms. When the covariance matrices, $Σ_{U}, Σ_{V}$ , and $Σ_{G}$ have general structures, we can still derive the full conditional posteriors in closed-forms by rearranging the elements of covariance matrices according to the way we unfold matrices $U, V$ and tensor $G$ . For example, given that the prior of $v e c V_{(1)}$ is a normal distribution with arbitrary covariance matrix $Σ_{V}$ , to update $v e c V_{(1)}^{T}$ , we need the covariance matrix of $v e c V_{(1)}^{T}$ by rearranging the elements of $Σ_{V}$ , and then the posterior distribution of $v e c V_{(1)}^{T}$ is given in the same formula as shown in equations (9) and (10). Similarly, to update $G$ , we need the covariance matrix of $v e c G_{ℛ \times 𝒞}$ , denoted as ${\tilde{Σ}}_{G}$ , which can be accessed by rearranging the elements of $Σ_{G}$ , the covariance matrix of $v e c G$ .

For the noise term $v e c E$ , we assume that $v e c E$ follows a normal distribution with a diagonal covariance matrix $σ^{2} I_{N Q}$ in the main manuscript. Here we consider an alternative construction: $v e c E ~ N (0, Σ_{E})$ with $Σ_{E} = Σ \otimes I_{N}$ in the form of Kronecker product, and an Inverse-Wishart prior to $Σ$ , i.e., $Σ ~ I W (Ψ, ν)$ . Under such a prior setup, the posterior updates of $v e c U_{(l)}, v e c V_{(s)}, v e c G$ , and $Σ$ are still in closed forms.

Specifically, the full conditional posterior distribution of $v e c U_{(1)}$ is normally distributed:

p (v e c U_{(1)} ∣ v e c Y, X, Σ_{E}, V_{(m)}, U_{(l)} l \neq 1) ~ N (μ_{U}^{'}, Σ_{U}^{'}),

where

Σ_{U}^{'} = {(C_{(ℛ \times 𝒞)}^{T} Σ_{E}^{- 1} C_{(ℛ \times 𝒞)} + Σ_{U}^{- 1})}^{- 1}, μ_{U}^{'} = Σ_{U}^{'} (C_{(ℛ \times 𝒞)}^{T} Σ_{E}^{- 1} v e c Y + Σ_{U}^{- 1} μ_{U}) .

And the full conditional posterior distribution of $v e c {V_{(1)}}^{T}$ is normally distributed:

p (v e c V_{(1)}^{T} ∣ v e c \tilde{Y}, X, Σ_{E}, U_{(l)}, V_{(m)} m \neq 1) ~ N ({\tilde{μ}}_{V}^{'}, {\tilde{Σ}}_{V}^{'}),

where

{\tilde{Σ}}_{V}^{'} = {({(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} {\tilde{Σ}}^{- 1} (I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)}) + {\tilde{Σ}}_{V}^{- 1})}^{- 1}, {\tilde{μ}}_{V}^{'} = {\tilde{Σ}}_{V}^{'} ({(I_{Q_{1}} \otimes D_{(ℛ \times 𝒞)})}^{T} {\tilde{Σ}}^{- 1} v e c \tilde{Y} + {\tilde{Σ}}_{V}^{- 1} {\tilde{μ}}_{V}) .

and $\tilde{Σ}$ is the covariance matrix of $v e c (E_{(2)}^{T})$ by rearranging elements of $Σ_{E}$ .

The derivation of the full conditional posterior of $v e c G_{(ℛ \times 𝒞)}$ is relatively more complex compared to the posteriors of $U$ and $V$ . Starting from

v e c \tilde{\tilde{Y}} = (I_{S} \otimes (X_{(1)} U)) v e c G_{(ℛ \times 𝒞)} + v e c \tilde{E},

where $\tilde{\tilde{E}} = E_{(1)} V {(V^{T} V)}^{- 1}$ , we can write $v e c \tilde{\tilde{E}} = (({(V^{T} V)}^{- 1} V^{T}) \otimes I_{N}) v e c E$ . Then the covariance matrix of $v e c \tilde{\tilde{E}}$ becomes

\tilde{\tilde{Σ}} = \tilde{V} Σ_{E} {\tilde{V}}^{T},

where $\tilde{V} = (({(V^{T} V)}^{- 1} V^{T}) \otimes I_{N})$ . The full conditional posterior distribution of $v e c G_{(ℛ \times 𝒞)}$ is a normal distribution with

{\tilde{μ}}_{G}^{'} = {\tilde{Σ}}_{G}^{'} ({(I_{S} \otimes (X_{(1)} U))}^{T} {\tilde{\tilde{Σ}}}^{- 1} v e c \tilde{\tilde{Y}} + {\tilde{Σ}}_{G}^{- 1} {\tilde{μ}}_{G}),

{\tilde{Σ}}_{G}^{'} = {({(I_{S} \otimes (X_{(1)} U))}^{T} {\tilde{\tilde{Σ}}}^{- 1} (I_{S} \otimes (X_{(1)} U)) + {({\tilde{Σ}}_{G})}^{- 1})}^{- 1}

Table A.1.

Mean RPE with SD on uncorrelated-t data

RPE (SD)	BayTensor MCMC	BayTensor Fast	CP Method	OLS Method	SAS Method	HS Method
$θ * = (4, 4, 2, 2)$ , SNR=2	0.353 (0.017)	0.374 (0.097)	0.407 (0.037)	1.019 (0.029)	0.968 (0.024)	1.058 (0.137)
$θ * = (3, 3, 3, 3)$ , SNR=2	0.344 (0.004)	0.350 (0.016)	0.415 (0.032)	1.014 (0.032)	0.958 (0.025)	1.109 (0.281)
$θ * = (4, 4, 2, 2)$ , SNR=5	0.173 (0.002)	0.208 (0.125)	0.235 (0.035)	0.747 (0.028)	0.949 (0.030)	1.104 (0.227)
$θ * = (3, 3, 3, 3)$ , SNR=5	0.172 (0.001)	0.182 (0.038)	0.256 (0.038)	0.747 (0.024)	0.931 (0.035)	1.173 (0.388)

Open in a new tab

Table A.2.

Core tensor dimension recovery and number of model parameters on uncorrelated-t data

	BayTensor MCMC		BayTensor Fast
	Dimension Recovery	# Parameters (SD)	Dimension Recovery	# Parameters (SD)
$θ * = (4, 4, 2, 2)$ , SNR=2	72%	205 (19)	42%	188 (29)
$θ * = (3, 3, 3, 3)$ , SNR=2	96%	218 (6)	58%	205 (26)
$θ * = (4, 4, 2, 2)$ , SNR=5	90%	213 (10)	68%	195 (31)
$θ * = (3, 3, 3, 3)$ , SNR=5	100%	219 (0)	82%	213 (24)

Open in a new tab

Lastly, the full conditional posterior of $Σ$ is an Inverse-Wishart distribution with $\tilde{ν} = N + ν$ , and $\tilde{Ψ} = S^{T} S + Ψ$ where $S = Y_{(1)} - X_{(1)} B_{(𝒫 \times 𝒬)}$ .

A.7. Simulation results: uncorrelated-t setup

We present the detailed results under the proposed BayTensor MCMC, BayTensor Fast, and four alternative methods (including the CP method, the OLS method, the SAS method, and the HS method) to datasets under the uncorrelated-t setup.

The prediction RPE results are presented in Table A.1. The RPE results under Baytensor MCMC and BayTensor Fast are comparable in all cases. And both of them have better RPEs than the CP method followed by the SAS method, and the OLS method. The HS performs the worst in terms of yielding the highest RPEs in all cases. And when the signal to noise ratio increases from 2 to 5, the prediction accuracies are improved under all methods, except for the HS method.

Table A.2 shows the empirical probabilities of dimension recovery and the average numbers of parameters required by BayTensor MCMC and BayTensor Fast in 50 replicated experiments. In all 4 cases, BayTensor MCMC and BayTensor Fast both require a smaller number of parameters than the CP method. And in terms of recovering the dimension of the core tensor, BayTensor MCMC has higher empirical probabilities of recovering the true dimension than BayTensor Fast in all cases. And the recovering probabilities in the cases where $θ^{*} = (3, 3, 3, 3)$ are higher than those in cases where $θ^{*} = (4, 4, 2, 2)$ for both methods.

We also observe that the RPEs under uncorrelated-t setups are larger than those under uncorrelated-normal setups for the BayTensor methods and the CP method. This is expected because the fitting model is different from the data generation model. However, the magnitudes of RPE increase under BayTensor algorithms are small. Moreover, the dimension recovery rates and number of parameters are also similar to those under uncorrelated-normal setups. These results indicate that the proposed BayTensor algorithms have reasonably good performance when the fitting model is misspecified.

A.8. Traceplots of the estimated number of parameters associated with $θ$ in simulation study

We plot the estimated number of parameters associated with $θ^{(t)}$ versus iteration $t$ in the first 100 iterations of BayTensor MCMC algorithm for some randomlyselected experiments under the uncorrelated-normal simulation setup in Figure A.1, and under the correlated setup in Figure A.2.

A.9. Full simulation study results of CP method with different ranks

For each case of the simulation study, we tried the CP method with CP-rank $R$ values from 1 to 6. Note that when $R = 5, 6$ , the total number of parameters in CP method are 230 and 276 respectively which are larger than the true number of parameters (212 for $θ^{*} = (4, 4, 2, 2)$ cases and 219 for $θ^{*} = (3, 3, 3, 3)$ cases). We calculate the RPE results with different $R$ values for all 50 repeated experiments under all simulation setups. And means and standard deviations of RPEs averaging over the 50 repeated experiments for all 8 simulation setups are shown in Table A.3.

Figure A.1. — The estimated number of parameters associated with $θ^{(t)}$ versus iteration $t$ in the first 100 iterations of BayTensor MCMC algorithm for some randomly-selected experiments under the uncorrelated-normal simulation setup. The true number of parameters is shown by the red line. (a): SNR = 2, $θ^{*} = (4, 4, 2, 2)$ . (b): SNR = 2, $θ^{*} = (3, 3, 3, 3)$ . (c): SNR = 5, $θ^{*} = (4, 4, 2, 2) .$ (d): SNR = 5, $θ^{*} = (3, 3, 3, 3)$ .

Table A.3.

Mean RPE with SD for the CP method with different R values

RPE (SD)	$R = 1$	$R = 2$	$R = 3$	$R = 4$	$R = 5$	$R = 6$
Uncorrelated Data
$θ * = (4, 4, 2, 2)$ , SNR=2	0.771(0.102)	0.627(0.087)	0.529(0.095)	0.463(0.061)	0.409(0.036)	0.377(0.014)
$θ * = (3, 3, 3, 3)$ , SNR=2	0.780(0.081)	0.623(0.061)	0.535(0.048)	0.466(0.038)	0.422(0.024)	0.391(0.016)
$θ * = (4, 4, 2, 2)$ , SNR=5	0.718(0.137)	0.530(0.109)	0.392(0.072)	0.308(0.052)	0.241(0.039)	0.209(0.046)
$θ * = (3, 3, 3, 3)$ , SNR=5	0.718(0.095)	0.527(0.074)	0.414(0.060)	0.320(0.044)	0.266(0.032)	0.222(0.016)
Correlated Data
$θ * = (4, 4, 2, 2)$ , SNR=2	0.533(0.152)	0.423(0.095)	0.382(0.043)	0.371(0.021)	0.370(0.011)	0.377(0.014)
$θ * = (3, 3, 3, 3)$ , SNR=2	0.514(0.120)	0.429(0.070)	0.391(0.037)	0.374(0.019)	0.370(0.010)	0.369(0.007)
$θ * = (4, 4, 2, 2)$ , SNR=5	0.407(0.170)	0.266(0.104)	0.219(0.056)	0.198(0.034)	0.190(0.014)	0.188(0.006)
$θ * = (3, 3, 3, 3)$ , SNR=5	0.413(0.187)	0.280(0.085)	0.230(0.047)	0.205(0.026)	0.194(0.015)	0.187(0.006)

Open in a new tab

Figure A.2. — The estimated number of parameters associated with $θ^{(t)}$ versus iteration $t$ in the first 100 iterations of BayTensor MCMC algorithm for some randomly-selected experiments under the correlated simulation setup. The true number of parameters is shown by the red line. (a): SNR = 2, $θ^{*} = (4, 4, 2, 2)$ . (b): SNR = 2, $θ^{*} = (3, 3, 3, 3)$ . (c): SNR = 5, $θ^{*} = (4, 4, 2, 2) .$ (d): SNR = 5, $θ^{*} = (3, 3, 3, 3)$ .

Contributor Information

Kunbo Wang, 3400 N. Charles Street, Baltimore, MD 21218.

Yanxun Xu, 3400 N. Charles Street, Baltimore, MD 21218.

REFERENCES

Andrieu C and Roberts GO (2009). The pseudo-marginal approach for efficient monte carlo computations. The Annals of Statistics, 37(2):697–725. MR2502648 [Google Scholar]
Billio M, Casarin R, Kaufmann S, and Iacopini M (2018). Bayesian dynamic tensor regression. University Ca’Foscari of Venice, Dept. of Economics Research Paper Series No, 13. [Google Scholar]
Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480. MR2650751 [Google Scholar]
Dosso S and Oldenburg D (1991). Magnetotelluric appraisal using simulated annealing. Geophysical Journal International, 106(2):379–385. [Google Scholar]
Gahrooei MR, Yan H, Paynabar K, and Shi J (2021). Multiple tensor-on-tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, 63(2):147–159. MR4251490 [Google Scholar]
Geman S and Geman D (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741. [DOI] [PubMed] [Google Scholar]
Green PJ (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732. MR1380810 [Google Scholar]
Guhaniyogi R, Qamar S, and Dunson DB (2017). Bayesian tensor regression. The Journal of Machine Learning Research, 18(1):2733–2763. MR3714242 [Google Scholar]
Guhaniyogi R and Spencer D (2021). Bayesian tensor response regression with an application to brain activation studies. Bayesian Analysis, 16(4):1221–1249. MR4381133 [Google Scholar]
Guo W, Kotsia I, and Patras I (2012). Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827. MR2932176 [DOI] [PubMed] [Google Scholar]
Harshman RA (1970). Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis.
Hasan KM, Walimuni IS, Abid H, and Hahn KR (2011). A review of diffusion tensor magnetic resonance imaging computational methods and software tools. Computers in Biology and Medicine, 41(12):1062–1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hassner T, Harel S, Paz E, and Enbar R (2015). Effective face frontalization in unconstrained images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4295–4304. [Google Scholar]
Hoff PD (2015). Multilinear tensor regression for longitudinal relational data. The annals of applied statistics, 9(3):1169. MR3418719 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoffman MD, Blei DM, Wang C, and Paisley J (2013). Stochastic variational inference. Journal of Machine Learning Research. MR3081926 [Google Scholar]
Huang GB, Ramesh M, Berg T, and Learned-Miller E (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07–49, University of Massachusetts, Amherst. [Google Scholar]
Ishwaran H and Rao JS (2005). Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773. MR2163158 [Google Scholar]
Kirkpatrick S, Gelatt CD Jr, and Vecchi MP (1983). Optimization by simulated annealing. Science, 220(4598):671–680. MR0702485 [DOI] [PubMed] [Google Scholar]
Kolda TG (2006). Multilinear operators for higher-order decompositions, volume 2. United States. Department of Energy. [Google Scholar]
Kolda TG and Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500. MR2535056 [Google Scholar]
Kumar N, Berg A, Belhumeur PN, and Nayar S (2011). Describable visual attributes for face verification and image search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10):1962–1977. [DOI] [PubMed] [Google Scholar]
Lankao PR, Nychka D, and Tribbia JL (2008). Development and greenhouse gas emissions deviate from the ‘modernization’theory and ‘convergence’hypothesis. Climate Research 38(1):17–29. [Google Scholar]
Lee J, Müller P, Sengupta S, Gulukota K, and Ji Y (2016). Bayesian feature allocation models for tumor heterogeneity In Statistical Analysis for High-Dimensional Data, pages 211–232. Springer. MR3616270 [Google Scholar]
Li L and Zhang X (2017). Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131–1146. MR3735365 [Google Scholar]
Li X, Xu D, Zhou H, and Li L (2013). Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, pages 1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lock EF (2018). Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, pages 1–10. MR3863764 [DOI] [PMC free article] [PubMed] [Google Scholar]
Miranda MF, Zhu H, Ibrahim JG, et al. (2018). Tprm: Tensor partition regression models with applications in imaging biomarker detection. The Annals of Applied Statistics, 12(3):1422–1450. MR3852683 [DOI] [PMC free article] [PubMed] [Google Scholar]
O’Hagan A (1995). Fractional bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99–118. MR1325379 [Google Scholar]
Tucker LR (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. MR0205395 [DOI] [PubMed] [Google Scholar]
Van der Aa N, Luo X, Giezeman G-J, Tan RT, and Veltkamp RC (2011). Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 1264–1269. IEEE. [Google Scholar]
Vasilescu MAO and Terzopoulos D (2002). Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pages 447–460. Springer. [Google Scholar]
Wang M, Fischer J, and Song YS (2019). Three-way clustering of multi-tissue multi-individual gene expression data using seminonnegative tensor decomposition. The Annals of Applied Statistics, 13(2):1103. MR3963564 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang L, Guindani M, and Vannucci M (2015). Bayesian models for functional magnetic resonance imaging data analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1):21–41. MR3348719 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou H, Li L, and Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552. MR3174640 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Andrieu C and Roberts GO (2009). The pseudo-marginal approach for efficient monte carlo computations. The Annals of Statistics, 37(2):697–725. MR2502648 [Google Scholar]

[R2] Billio M, Casarin R, Kaufmann S, and Iacopini M (2018). Bayesian dynamic tensor regression. University Ca’Foscari of Venice, Dept. of Economics Research Paper Series No, 13. [Google Scholar]

[R3] Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480. MR2650751 [Google Scholar]

[R4] Dosso S and Oldenburg D (1991). Magnetotelluric appraisal using simulated annealing. Geophysical Journal International, 106(2):379–385. [Google Scholar]

[R5] Gahrooei MR, Yan H, Paynabar K, and Shi J (2021). Multiple tensor-on-tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, 63(2):147–159. MR4251490 [Google Scholar]

[R6] Geman S and Geman D (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741. [DOI] [PubMed] [Google Scholar]

[R7] Green PJ (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732. MR1380810 [Google Scholar]

[R8] Guhaniyogi R, Qamar S, and Dunson DB (2017). Bayesian tensor regression. The Journal of Machine Learning Research, 18(1):2733–2763. MR3714242 [Google Scholar]

[R9] Guhaniyogi R and Spencer D (2021). Bayesian tensor response regression with an application to brain activation studies. Bayesian Analysis, 16(4):1221–1249. MR4381133 [Google Scholar]

[R10] Guo W, Kotsia I, and Patras I (2012). Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827. MR2932176 [DOI] [PubMed] [Google Scholar]

[R11] Harshman RA (1970). Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis.

[R12] Hasan KM, Walimuni IS, Abid H, and Hahn KR (2011). A review of diffusion tensor magnetic resonance imaging computational methods and software tools. Computers in Biology and Medicine, 41(12):1062–1072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Hassner T, Harel S, Paz E, and Enbar R (2015). Effective face frontalization in unconstrained images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4295–4304. [Google Scholar]

[R14] Hoff PD (2015). Multilinear tensor regression for longitudinal relational data. The annals of applied statistics, 9(3):1169. MR3418719 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Hoffman MD, Blei DM, Wang C, and Paisley J (2013). Stochastic variational inference. Journal of Machine Learning Research. MR3081926 [Google Scholar]

[R16] Huang GB, Ramesh M, Berg T, and Learned-Miller E (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07–49, University of Massachusetts, Amherst. [Google Scholar]

[R17] Ishwaran H and Rao JS (2005). Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773. MR2163158 [Google Scholar]

[R18] Kirkpatrick S, Gelatt CD Jr, and Vecchi MP (1983). Optimization by simulated annealing. Science, 220(4598):671–680. MR0702485 [DOI] [PubMed] [Google Scholar]

[R19] Kolda TG (2006). Multilinear operators for higher-order decompositions, volume 2. United States. Department of Energy. [Google Scholar]

[R20] Kolda TG and Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500. MR2535056 [Google Scholar]

[R21] Kumar N, Berg A, Belhumeur PN, and Nayar S (2011). Describable visual attributes for face verification and image search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10):1962–1977. [DOI] [PubMed] [Google Scholar]

[R22] Lankao PR, Nychka D, and Tribbia JL (2008). Development and greenhouse gas emissions deviate from the ‘modernization’theory and ‘convergence’hypothesis. Climate Research 38(1):17–29. [Google Scholar]

[R23] Lee J, Müller P, Sengupta S, Gulukota K, and Ji Y (2016). Bayesian feature allocation models for tumor heterogeneity In Statistical Analysis for High-Dimensional Data, pages 211–232. Springer. MR3616270 [Google Scholar]

[R24] Li L and Zhang X (2017). Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131–1146. MR3735365 [Google Scholar]

[R25] Li X, Xu D, Zhou H, and Li L (2013). Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, pages 1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Lock EF (2018). Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, pages 1–10. MR3863764 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Miranda MF, Zhu H, Ibrahim JG, et al. (2018). Tprm: Tensor partition regression models with applications in imaging biomarker detection. The Annals of Applied Statistics, 12(3):1422–1450. MR3852683 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] O’Hagan A (1995). Fractional bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99–118. MR1325379 [Google Scholar]

[R29] Tucker LR (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. MR0205395 [DOI] [PubMed] [Google Scholar]

[R30] Van der Aa N, Luo X, Giezeman G-J, Tan RT, and Veltkamp RC (2011). Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 1264–1269. IEEE. [Google Scholar]

[R31] Vasilescu MAO and Terzopoulos D (2002). Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pages 447–460. Springer. [Google Scholar]

[R32] Wang M, Fischer J, and Song YS (2019). Three-way clustering of multi-tissue multi-individual gene expression data using seminonnegative tensor decomposition. The Annals of Applied Statistics, 13(2):1103. MR3963564 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Zhang L, Guindani M, and Vannucci M (2015). Bayesian models for functional magnetic resonance imaging data analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1):21–41. MR3348719 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Zhou H, Li L, and Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552. MR3174640 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian tensor-on-tensor regression with efficient computation

Kunbo Wang

Yanxun Xu

Abstract

1. INTRODUCTION

2. PRELIMINARIES

2.1. Notations

2.2. Tucker decomposition

3. A BAYESIAN TENSOR-ON-TENSOR REGRESSION MODEL

4. POSTERIOR INFERENCE

4.1. Posterior inference given the dimension of the core tensor

Figure 2.

4.2. Updating the model dimension

5. FAST COMPUTING ALGORITHM

6. SIMULATION STUDIES

6.1. Simulation setup

6.2. Simulation results: uncorrelated setup

Table 1.

Table 2.

Figure 3.

6.3. Simulation results: correlated setup

Table 3.

Table 4.

7. REAL DATA ANALYSES

7.1. Facial image data

Figure 4.

7.2. Utrecht multi-person motion (UMPM) data

Figure 5.

8. DISCUSSION

Figure 1.

ACKNOWLEDGMENTS

APPENDIX

A.1. A brief review of matrix Kronecker product

A.2. Proof for equation (6)

A.3. Proof for equation (8)

A.4. Conditional posterior distributions given training fraction b

A.5. Proof for equation (15)

A.6. More general covariance structures for prior distributions and noise

Table A.1.

Table A.2.

A.7. Simulation results: uncorrelated-t setup

A.8. Traceplots of the estimated number of parameters associated with θ in simulation study

A.9. Full simulation study results of CP method with different ranks

Figure A.1.

Table A.3.

Figure A.2.

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.4. Conditional posterior distributions given training fraction $b$

A.8. Traceplots of the estimated number of parameters associated with $θ$ in simulation study