Tensor-tensor algebra for optimal representation and compression of multiway data

Misha E Kilmer; Lior Horesh; Haim Avron; Elizabeth Newman

doi:10.1073/pnas.2015851118

. 2021 Jul 7;118(28):e2015851118. doi: 10.1073/pnas.2015851118

Tensor-tensor algebra for optimal representation and compression of multiway data

Misha E Kilmer ^a, Lior Horesh ^b, Haim Avron ^c, Elizabeth Newman ^d,¹

PMCID: PMC8285895 PMID: 34234014

Significance

Many real-world data are inherently multidimensional; however, often data are processed as two-dimensional arrays (matrices), even if the data are naturally represented in higher dimension. The common practice of matricizing high-dimensional data is due to the ubiquitousness and strong theoretical foundations of matrix algebra. Various tensor-based approximations have been proposed to exploit high-dimensional correlations. While these high-dimensional techniques have been effective in many applications, none have been theoretically proven to outperform matricization generically. In this study, we propose matrix-mimetic, tensor-algebraic formulations to preserve and process data in its native, multidimensional format. For a general family of tensor algebras we prove the superiority of optimal truncated tensor representations to traditional matrix-based representations with implications for other related tensorial frameworks.

Keywords: tensor, compression, multiway data, SVD, rank

Abstract

With the advent of machine learning and its overarching pervasiveness it is imperative to devise ways to represent large datasets efficiently while distilling intrinsic features necessary for subsequent analysis. The primary workhorse used in data dimensionality reduction and feature extraction has been the matrix singular value decomposition (SVD), which presupposes that data have been arranged in matrix format. A primary goal in this study is to show that high-dimensional datasets are more compressible when treated as tensors (i.e., multiway arrays) and compressed via tensor-SVDs under the tensor-tensor product constructs and its generalizations. We begin by proving Eckart–Young optimality results for families of tensor-SVDs under two different truncation strategies. Since such optimality properties can be proven in both matrix and tensor-based algebras, a fundamental question arises: Does the tensor construct subsume the matrix construct in terms of representation efficiency? The answer is positive, as proven by showing that a tensor-tensor representation of an equal dimensional spanning space can be superior to its matrix counterpart. We then use these optimality results to investigate how the compressed representation provided by the truncated tensor SVD is related both theoretically and empirically to its two closest tensor-based analogs, the truncated high-order SVD and the truncated tensor-train SVD.

1. Introduction

A. Overview.

Following the discovery of the spectral decomposition by Lagrange in 1762, the singular value decomposition (SVD) was discovered independently by Beltrami and Jordan in 1873 and 1874, respectively. Further generalization of the decomposition came independently by Sylvester in 1889 and by Autonne in 1915 (1). Perhaps the most notable theoretical results associated with the decomposition is due to Eckart and Young, who provided the first optimality proof of the decomposition back in 1936 (2).

The use of SVD in data analysis is ubiquitous. From a statistical point of view, the singular vectors of a (mean-subtracted) data matrix represent the principal component directions, i.e., the directions in which there is maximum variance corresponding to the largest singular values. However, the SVD is historically well-motivated by spectral analysis of linear operators. In typical data analysis tasks, matrices are treated as rectangular arrays of data which may not correspond directly to either statistical interpretations or representations of linear transforms. Hence, the prevalence of the SVD in data analysis applications requires further investigation.

The utility of the SVD in the context of data analysis is due to two key factors: the aforementioned Eckart–Young theorem (also known as the Eckart–Young–Minsky theorem) and the fact that the SVD (or in some cases a partial decomposition or high-fidelity approximation) can be efficiently computed relative to the matrix dimensions and/or desired rank of the partial decomposition (see, e.g., ref. 3 and references therein). Formally, the Eckart–Young theorem offers the solution to the problem of finding the best (in the Frobenius norm or 2-norm) rank- $k$ approximation to a matrix with rank greater than $k$ in terms of the first $k$ terms of the SVD expansion of that matrix. The theorem implies, in some informal sense, that the majority of the informational content is captured by the dominant singular subspaces (i.e., the span of the singular vectors corresponding to the largest singular values), opening the door to compression, efficient representation, denoising, and so on. The SVD motivates modeling data as matrices, even in cases where a more natural model is a high-dimensional array (a tensor)—a process known as matricization.

Intuitively, there is an inherent disadvantage of matricization of data which can naturally be represented as a tensor. For example, a grayscale image is intrinsically represented as a matrix of numbers, but a video is natively represented as a tensor since there is an additional dimension: time. Preservation of the dimensional integrity of the data can be imperative for subsequent analysis, e.g. to account for high-dimensional correlations embedded in the structures in which the data are organized. Even so, in practice, there is often a surprising dichotomy between the data representation and the algebraic constructs employed for its processing. Thus, in the last century there have been efforts to define decomposition of tensorial structures, e.g. CANDECOMP/PARAFAC (CP) (4–6), Tucker (7), higher-order SVD (HOSVD) (8), and Tensor-Train (9). However, none of the aforementioned decompositions offers an Eckart–Young-like optimality result.

In this study, we attempt to close this gap by proving a general Eckart–Young optimality theorem for a tensor-truncated representation. An Eckart–Young-like theorem must revolve around some tensor decomposition and a metric. We consider tensor decompositions built around the idea of the t-product presented in ref. 10 and the tensor-tensor product extensions of a similar vein in ref. 11. In refs. 10 and 11 the authors define tensor-tensor products between third-order tensors and corresponding algebra in which notions of identity, orthogonality, and transpose are all well-defined. For the t-product in ref. 10, the authors show there exists an Eckart–Young type of optimality result in the Frobenius norm. Other popular tensor decompositions [e.g., Tucker, HOSVD, CP, Tensor-Train SVD (TT-SVD) (4, 7–9, 12)] do not lend themselves easily to a similar analysis, either because direct truncation does not lead to an optimal lower-term approximation or because they lack analogous definitions of orthogonality and energy preservation.

In this paper, for the generalization of the t-product, the $⋆_{M}$ -product (11), we prove two Eckart–Young optimality results. Importantly, we prove the superiority of the corresponding compressed tensor representation to the matrix compressed representation of the same data. This is a theoretical proof of superiority of compression of multiway data using tensor decomposition as opposed to treating the data as a matrix. Leveraging our algebraic framework, we show that the HOSVD can be interpreted as a special case of the class of tensor-tensor products we consider. As such, we are able to apply our tensor Eckart–Young results to explain why truncation of the HOSVD will, in general, not give an optimal compressed representation when compared to the proposed truncated tensor SVD approach. Moreover, we show how our Eckart–Young results can be applied in the context of comparison of the proposed tensor-tensor framework to a truncated TT-SVD approximation.

The $⋆_{M}$ -product-based algebra is orientation-dependent, meaning that the ability to compress the data in a meaningful way depends on the orientation of the tensor (i.e., how the data are organized in the tensor). For example, a collection of $ℓ$ grayscale images of size $m \times n$ can be placed into a tensor as lateral slices or be rotated first and then placed as lateral slices, and so on. In many applications, there are good reasons to keep the second, lateral, dimension fixed (i.e., representing time, total number of samples, etc.), but in others there may be no obvious merit to preferentially treat one of the other two dimensions. Thus, we also consider variants of the t-product approach that offer optimal approximations to the tensorized data without handling one spatial orientation differently than another.

We primarily limit the discussion to third-order tensors, though in the final section we discuss how the ideas generalize to higher-order as well. Indeed, the potential for even greater compression gain for higher-order representations of the data exists, provided the data have higher-dimensional correlations to be exploited.

B. Paper Organization.

In Section 2, we give background notation and definitions. In Section 3, we define decompositions, the t-SVDM and its variant, the t-SVDMII, and prove an Eckart–Young-like theorem for each. In Section 5, we employ these theorems to prove the superior representation of the t-SVDM and t-SVDMII compared to the matrix SVD. To provide intuition, Section 4 and Section 5 discuss when and why some data are more amenable to optimal representation through the use of the truncated t-SVDM than through the matrix SVD. In Section 6, we relate the $⋆_{M}$ -framework to other related tensorial frameworks, including HOSVD and TT-SVD. Section 7 discusses multisided tensor compression. Section 8 contains a numerical study and highlights extensions to higher-order data. A summary and future work are the subjects of Section 9.

2. Background

For the purposes of this paper, a tensor is a multidimensional array, and the order of the tensor is defined as the number of dimensions of this array. As we are concerned with third-order tensors throughout most of the discussion, we limit our notation and definitions to the third-order case here and generalize to higher order in the final section.

A. Notation and Indexing

A third-order tensor $A$ is an object in $C^{m \times p \times n}$ . Its Frobenius norm, ${‖ A ‖}_{F}$ , is analogous to the matrix case, that is, ${‖ A ‖}_{F}^{2} = \sum_{i, j, k} | A_{i, j, k} |^{2}$ . We use MATLAB notation for entries: $A_{i, j, k}$ denotes the entry at row $i$ and column $j$ of the matrix going $k$ “inward.” The fibers of tensor $A$ are defined by fixing two indices. Of note are the tube fibers, written as $A_{i, j,:}$ or $a_{i, j}$ , $i = 1 : m, j = 1 : p$ . A slice of a third-order tensor $A$ is a two-dimensional array defined by fixing one index. Of particular note are the frontal and lateral slices, as depicted in Fig. 1. The $i^{t h}$ frontal slice is expressed as $A_{:, :, i}$ and also referenced as $A^{(i)}$ for convenience in later definitions. The $j^{t h}$ lateral slice would be $A_{:, j,:}$ or equivalently expressed as ${\vec{A}}_{j}$ .

Fig. 1. — Fibers and slices of $m \times p \times n$ tensor $A$ . Left to right: Frontal slices, denoted either $A^{(i)}$ or $A_{:, :, i}$ ; lateral slices denoted ${\vec{A}}_{j}$ or $A_{:, j,:}$ ; tube fibers $a_{i j}$ or $A_{i, j,:}$ .

Some other notation that we use for convenience are the vec and reshape operators that map matrices to vectors by column unwrapping, and vice versa:

a = vec (A) \in C^{m n} \leftrightarrow A = reshape (a, [m, n]) .

We can also define invertible mappings between $m \times n$ matrices and $m \times 1 \times n$ tensors by twisting and squeezing* (13): i.e., $X \in C^{m \times n}$ is related to $\vec{X} \in C^{m \times 1 \times n}$ via

\vec{X} = twist (X) and X = s q (\vec{X}) .

The mode-1, mode-2, and mode-3 unfoldings of $A \in C^{m \times p \times n}$ are $m \times p n$ , $p \times m n$ , and $n \times m p$ , respectively, and are given by

\begin{matrix} A_{(1)} & ≔ & [A^{(1)}, \dots, A^{(n)}] \\ A_{(2)} & ≔ & [{(A^{(1)})}^{⊤}, \dots, {(A^{(n)})}^{⊤}] \\ A_{(3)} & ≔ & [s q {(A_{:, 1,:})}^{⊤}, s q {(A_{:, 2,:})}^{⊤}, \dots, s q {(A_{:, p,:})}^{⊤}] \end{matrix} .

[1]

These are useful in defining modewise tensor-matrix products (see ref. 12). For example, $A \times_{3} M$ for a $r \times n$ matrix $M$ is equivalent to computing the matrix-matrix product $M A_{(3)}$ , which is an $r \times m p$ matrix, and then reshaping the result to an $m \times p \times r$ tensor.

The HOSVD (8) can be expressed as

A = C \times_{1} U \times_{2} V \times_{3} W,

[2]

where $U, V, W$ are the matrices containing the left singular vectors, corresponding to nonzero singular values, of the matrix SVDs of $A_{(1)}, A_{(2)}, A_{(3)}$ , respectively. For an $m \times p \times n$ tensor, $U$ would be $m \times r_{1}$ , $V$ , $p \times r_{2}$ , and $W n \times r_{3}$ , where $r_{1}, r_{2}, r_{3}$ are the ranks of the three respective unfoldings. Correspondingly, we say the tensor has HOSVD rank $(r_{1}, r_{2}, r_{3})$ . The $r_{1} \times r_{2} \times r_{3}$ core tensor is given by $C ≔ A \times_{1} U^{H} \times_{2} V^{H} \times_{3} W^{H}$ . While the columns of the factor matrices are orthonormal, the core need not be diagonal, and its entries need not be nonnegative. In practice, compression is achieved by truncating to an HOSVD rank $k = (k_{1}, k_{2}, k_{3})$ , but unlike the matrix case such truncation does not lead to an optimal truncated approximation in a norm sense. It does, however, offer a quasi-optimal approximation in the following sense: ${‖ A - B_{k} ‖}_{F} \leq \sqrt{d} {‖ A - B_{*} ‖}_{F}$ , which means that if there exists an optimal solution $B_{*}$ with a small error then a truncated HOSVD $B_{k}$ will yield a bounded approximation to it (14–16).

If $a, b, c$ are length $m, n, p$ vectors, respectively, then $B ≔ u ○ v ○ w$ (i.e., $B_{i, j, k} = u_{i} v_{j} w_{k}$ ) is called a rank-1 tensor. A CP (4–6) decomposition of a tensor $A$ is an expression as a sum of rank-1 outer-products:

A = \sum_{i = 1}^{r} s_{i} (u^{(i)} ○ v^{(i)} ○ w^{(i)}) = : [[S; U, V, W]],

where the factor matrices $U, V, W$ have the $u^{(i)}, v^{(i)}, w^{(i)}$ as their columns, respectively, and $S$ is a diagonal matrix containing the weights. If $r$ is minimal, then $r$ is said to be the rank of the tensor; that is, $r$ is the tensor rank. Although it is known that $r \leq min (n p, m p, n m)$ , determining the rank of a tensor is an NP-hard problem (17). Also, the factor matrices need not have orthonormal columns, nor be full rank.^† .

However, if we are given a set of factor matrices with $k$ columns such that $A = [[U, V, W]]$ holds, this is still a CP decomposition, even if we do not know if $k$ corresponds to, or is bigger than, the rank. In other words, given a CP decomposition where the factor matrices have $k$ columns, all we know is that the rank of the tensor represented by this decomposition has tensor rank at most $k$ .

In the remaining subsections of this section, we provide background on a tensor-tensor product representation recently developed in the literature (10, 11, 13). The primary goal of this paper is to derive provably optimal (i.e., minimal, in the Frobenius norm) approximations to tensors under this matrix mimetic framework. We compare these approximations to those derived by processing the same data in matrix form and show links between our approximations and to direct tensorial factorizations that offer a notion of orthogonality, the HOSVD and TT-SVD.

B. A Family of Tensor-Tensor Products

The first closed multiplicative operation between a pair of third-order tensors of appropriate dimension was given in ref. 10. That operation was named the t-product, and the resulting linear algebraic framework is described in refs. 10 and 13. In ref. 11, the authors followed the theme from ref. 10 by describing a new family of tensor-tensor products, called $⋆_{M}$ -product, and gave the associated algebraic framework. As the presentation in ref. 11 includes the t-product as one example, we will introduce the class of tensor-tensor products of interest and at times throughout the paper highlight the t-product as a special case.

Let $M$ be any invertible $n \times n$ matrix, and $A \in C^{m \times p \times n}$ . We will use hat notation to denote a tensor in the transform domain specified by $M$ :

\hat{A} ≔ A \times_{3} M,

where, since $M$ is $n \times n$ , $\hat{A}$ has the same dimension as $A$ . Importantly, $\hat{A}$ corresponds to applying $M$ along all tube fibers, although it is implemented according to the definition of computing the matrix-matrix product $M A_{(3)}$ and reshaping the result. The “hat” notation should be understood in context relative to the $M$ applied.

Algorithm 1:

Algorithm $A ⋆_{M} B$ for invertible $M$ from ref. 11

INPUT:

A \in C^{m \times p \times n}

B \in C^{p \times ℓ \times n}

, invertible

M \in C^{n \times n}

1: Define

\hat{A} ≔ A \times_{3} M

\hat{B} ≔ B \times_{3} M

2: for

i = 1, \dots, n

{\hat{C}}_{:, :, i} = {\hat{A}}_{:, :, i} {\hat{B}}_{:, :, i}

4: end for

5: Define-

C = \hat{C} \times_{3} M^{- 1} . Now C = A ⋆_{M} B \in C^{m \times ℓ \times n}

Open in a new tab

From ref. 11, we define the $⋆_{M}$ -product between $A \in C^{m \times p \times n}$ and $B \in C^{p \times r \times n}$ through the steps in Algorithm 1. The inner for-loop (steps 2 through 4) is the facewise product, denoted $\hat{C} = \hat{A} △ \hat{B}$ , and is embarrassingly parallelizable since the matrix-matrix products in the loop are independent. Step 1 (and 5) could in theory also be performed in $p$ independent blocks of matrix-matrix products (or matrix solves, in the case of $\times_{3} M^{- 1}$ to avoid inverse computation).

Choosing $M$ as the (unnormalized) DFT matrix and comparing to the algorithm in the previous section, we see that $A ⋆_{M} B$ effectively reduce to the t-product operation, $A^{*} B$ , defined in ref. 10. Thus, the t-product is a special instance of products from the $⋆_{M}$ -product family.

C. Tensor Algebraic Framework

Now we can introduce the remaining parts of the framework. For the t-product, the linear algebraic framework is in refs. 10 and 13. However, as the t-product is a special case of the $⋆_{M}$ as noted above, we will elucidate here the linear algebraic framework as described in ref. 11, with pointers to refs. 10 and 13 so we can cover all cases.

There exists a notion of conjugate transposition and an identity element:

Definition 2.1 (Conjugate Transpose): Given $A \in C^{m \times p \times n}$ its $p \times m \times n$ conjugate transpose under $⋆_{M} A^{H}$ is defined

{({\hat{A}}^{H})}_{:, :, i} = {({\hat{A}}_{:, :, i})}^{H}, i = 1, \dots, n .

As noted in ref. 11, this definition ensures the multiplication reversal property for the Hermitian transpose under $⋆_{M}$ : $A^{H} ⋆_{M} B^{H} = {(B ⋆_{M} A)}^{H} .$ This definition is consistent with the t-product transpose given in ref. 10 when $⋆_{M}$ is defined by the DFT matrix.

Definition 2.2 (Identity Tensor; Unit-Normalized Slices): The $m \times m \times n$ identity tensor $I$ satisfies $A ⋆_{M} I = A = I ⋆_{M} A$ for $A \in C^{m \times m \times n}$ . For invertible $M$ , this tensor always exists. Each frontal slice of $\hat{I} = I \times_{3} M$ is an identity matrix. If $\vec{B}$ is $m \times 1 \times n$ , and ${\vec{B}}^{H} ⋆_{M} \vec{B}$ is the $1 \times 1 \times n$ identity tensor under $⋆_{M}$ , we say the $\vec{B}$ is a unit-normalized tensor slice.

The concept of unitary and orthogonal tensors is now straightforward:

Definition 2.3 (Unitary/Orthogonality): Two $m \times 1 \times n$ tensors, $\vec{A}, \vec{B}$ , are called $⋆_{M}$ -orthogonal slices if ${\vec{A}}^{H} ⋆_{M} \vec{B}$ is the tube fiber $0$ . If $Q \in C^{m \times m \times n}$ ( $Q \in R^{m \times m \times n}$ ) is called $⋆_{M}$ -unitary,^‡ ( $⋆_{M}$ -orthogonal), if

Q^{H} ⋆_{M} Q = I = Q ⋆_{M} Q^{H},

where $H$ is replaced by transpose for real-valued tensors. Note that $I$ must be the one defined under $⋆_{M}$ as well and that tube fibers on the diagonal correspond to unit-normalized slices ${\vec{Q}}_{j}^{H} ⋆_{M} {\vec{Q}}_{j}$ and off-diagonals to tube fibers formed from products ${\vec{Q}}_{j}^{H} ⋆_{M} {\vec{Q}}_{k}$ with $k \neq j$ .

3. Tensor $⋆_{M}$ *M-SVDs and Optimal Truncated Representation

As noted in the previous section and described in more detail in ref. 11, any invertible matrix $M$ can be used to define a valid tensor-tensor product. However, in this paper we will focus on a specific class of matrices $M$ for which unitary invariance under the Frobenius norm is preserved, as we discuss in the following. We then use this feature to develop Eckart–Young theory for our tensor decompositions later in this section.

A. Unitary Invariance

Unitary invariance of the Frobenius norm of real-valued orthogonal tensors under the t-product was shown in ref. 10. Here, we prove a more general result.

Theorem 3.1. With the choice of $n \times n M = c W$ for unitary (orthogonal) $W$ , and nonzero $c$ , assume $Q$ is $m \times m \times n$ and $⋆_{M}$ -unitary ( $⋆_{M}$ -orthogonal). Then

{‖ Q ⋆_{M} B ‖}_{F} = {‖ B ‖}_{F}, B \in C^{m \times k \times n} .

Likewise, if $B \in C^{p \times m \times n}$ , ${‖ B ⋆_{M} Q ‖}_{F} = {‖ B ‖}_{F}$ .

Proof.

Suppose $M = c W$ where $W$ is unitary. Then, $M^{- 1} = \frac{1}{c} W^{H}$ . Next,

{‖ \hat{B} ‖}_{F} = {‖ B \times_{3} M ‖}_{F} = {‖ c W B_{(3)} ‖}_{F} = c {‖ B_{(3)} ‖}_{F} = c {‖ B ‖}_{F} .

[3]

Let $C = Q ⋆_{M} B$ . Using [3],

\begin{aligned} {‖ B ‖}_{F}^{2} = \frac{1}{| c |^{2}} {‖ \hat{B} ‖}_{F}^{2} & = \frac{1}{| c |^{2}} \sum_{i = 1}^{p} {‖ {\hat{Q}}_{:, :, i} {\hat{B}}_{:, :, i} ‖}_{F}^{2} \\ = \frac{1}{| c |^{2}} {‖ \hat{C} ‖}_{F}^{2} = {‖ C ‖}_{F}^{2} = {‖ Q ⋆_{M} B ‖}_{F}^{2}, \end{aligned}

as each ${\hat{Q}}_{:, :, i}$ is unitary. The other direction is similar.

We now have the framework we need to describe tensor SVDs induced by a fixed, $⋆_{M}$ operator. These were defined and existence was proven in ref. 10 for the t-product over real-valued tensors and in ref. 11 for $⋆_{M}$ more generally.

Definition 3.2 (10, 11): Let $A$ be a $m \times p \times n$ tensor. The (full) $⋆_{M}$ tensor SVD (t-SVDM) of $A$ is

A = U ⋆_{M} S ⋆_{M} V^{H} = \sum_{i = 1}^{r} U_{:, i,:} ⋆_{M} S_{i, i,:} ⋆_{M} V_{:, i,:}^{H},

[4]

where $U \in R^{m \times m \times n}$ , $V \in R^{p \times p \times n}$ are $⋆_{M} - u n i t a r y$ , and $S \in R^{m \times p \times n}$ is a tensor whose frontal slices are diagonal (such a tensor is called f-diagonal), and $r \leq min (m, p)$ is the number of nonzero tubes in $S$ . When $M$ is the DFT matrix, this reduces to the t-product-based t-SVD introduced in ref. 10.

Note that Definition 3.2 implies each frontal slice of $\hat{A}$ has rank less than or equal to $r$ . Clearly, if $m > p$ , from the second equality we can get a reduced t-SVDM, by restricting $U$ to have only $p$ orthonormal lateral slices, and $S$ to be $p \times p \times n$ , as opposed to the full representation. Similarly, if $p > m$ , we only need to keep the $m \times m \times n$ portion of $S$ and the $m$ columns of $V$ to obtain the same representation. An illustration of the decomposition is presented in Fig. 2, Top.

Fig. 2. — (*Top*) Illustration of the tensor SVD. (*Bottom*) Example showing different truncations across the different SVDs of the faces, based on *Algorithm 3*.

Algorithm 2:

Full t-SVDM from ref. 11

INPUT:

A \in C^{m \times p \times n}

, invertible

M \in C^{n \times n}

\hat{A} \leftarrow A \times_{3} M

2: for

i = 1, \dots, n

[{\hat{U}}_{:, :, i}, {\hat{S}}_{:, :, i}, {\hat{V}}_{:, :, i}] = svd ({\hat{A}}_{:, :, i})

End for

U = \hat{U} \times_{3} M^{- 1}, S = \hat{S} \times_{3} M^{- 1}, V = \hat{V} \times_{3} M^{- 1}

Open in a new tab

Independent of the choice of $M$ , the components of the t-SVDM are computed in transform space. We describe the full t-SVDM in Algorithm 2. As noted, the t-SVDM above was proposed already in ref. 11. However, when we restrict the class of $M$ to nonzero multiples of unitary or orthogonal matrices, we can now derive an Eckart–Young theorem for tensors in general form. To do so, we first give a new corollary for the restricted class of $M$ considered.

Corollary 3.3. Assume $M = c W$ , where $c \neq 0$ , and $W$ is unitary. Then given the t-SVDM of $A$ over $⋆_{M}$ defined in Definition 3.2,

{‖ A ‖}_{F}^{2} = {‖ S ‖}_{F}^{2} = \sum_{i = 1}^{min (p, m)} {‖ S_{i, i,:} ‖}_{F}^{2} .

Moreover, ${‖ S_{1,1,:} ‖}_{F}^{2} \geq {‖ S_{2,2,:} ‖}_{F}^{2} \geq \dots$ .

Proof: The proof of the first equality follows from Theorem 3.1, the second from the definition of Frobenius norm. To prove the ordering property, use the shorthand for each singular tube fiber as $s_{i} ≔ S_{i, i,:}$ , and note using [3] that

{‖ s_{i} ‖}_{F}^{2} = \frac{1}{c} {‖ {\hat{s}}_{i} ‖}_{F}^{2} = \frac{1}{c} \sum_{j = 1}^{n} {({\hat{σ}}_{i}^{(j)})}^{2},

where we have used ${\hat{σ}}_{i}^{(j)}$ to denote the $i^{t h}$ largest singular value of the $j^{t h}$ frontal face of $\hat{S}$ . However since ${\hat{σ}}_{i}^{(j)} \geq {\hat{σ}}_{i + 1}^{(j)}$ , the result follows.

This observation gives rise to a new definition.

Definition 3.4: We refer to $r$ in the t-SVDM Definition 3.2 (see the second equality in [4]) as the t-rank,^§ the number of nonzero singular tubes in the t-SVDM.

We can also extend the idea of multirank in ref. 13 to the $⋆_{M}$ general case:

Definition 3.5: The multirank of $A$ under $⋆_{M}$ is the vector $ρ$ such that its $i^{t h}$ entry $ρ_{i}$ denotes the rank of the $i^{t h}$ frontal slice of $\hat{A}$ ; that is, $ρ_{i} = rank ({\hat{A}}_{:, :, i})$ .

Notice that a tensor with multirank $ρ$ must have t-rank equal to $max_{i = 1, \dots, n} ρ_{i}$ .

Definition 3.6: The implicit rank under $⋆_{M}$ of $A_{ρ}$ is $r = \sum_{i = 1}^{n} ρ_{i}$ .

Note that $S$ is uniquely defined in Definition 3.2, thus the t-rank and multirank of a tensor are also unique.

B. Eckart–Young Theorem for Tensors

The key aspect that has made the t-product-based t-SVD so instrumental in many applications (see, for example, refs. 18–21) is the tensor Eckart–Young theorem proven in ref. 10 for real-valued tensors under the t-product. In loose terms, truncating the t-product-based t-SVDM gives an optimal low t-rank approximation in the Frobenius norm. An Eckart–Young theorem for the $⋆_{M}$ operator was not provided in ref. 11. We give a proof below for the special case that we have been considering in which $M$ is a multiple of a unitary matrix.

Theorem 3.7. Define $A_{k} = U_{:, 1 : k,:} ⋆_{M} S_{1 : k, 1 : k,:} ⋆_{M} V_{:, 1 : k,:}^{H}$ , where $M$ is a nonzero multiple of a unitary matrix. Then, $A_{k}$ is the best Frobenius norm approximation^¶ over the set $Γ = {C = X ⋆_{M} Y | X \in C^{m \times k \times n}, Y \in C^{k \times p \times n}},$ the set of all t-rank $k$ tensors under $⋆_{M}$ , of the same dimensions as $A$ . The squared error is ${‖ A - A_{k} ‖}_{F}^{2} = \sum_{i = k + 1}^{r} {‖ s_{i} ‖}_{F}^{2}$ , where $r$ is the t-rank of $A$ .

Proof: The squared error result follows easily from the results in the previous section. Now let $B = X ⋆_{M} Y$ . ${‖ A - B ‖}_{F}^{2} = \frac{1}{c} {‖ \hat{A} - \hat{B} ‖}_{F}^{2} = \frac{1}{c} \sum_{i = 1}^{n} {‖ {\hat{A}}_{:, :, i} - {\hat{B}}_{:, :, i} ‖}_{F}^{2} .$ By definition, ${\hat{B}}_{:, :, i}$ is a rank-k outer product ${\hat{X}}_{:, :, i} {\hat{Y}}_{:, :, i}$ . The best rank-k approximation to ${\hat{B}}_{:, :, i}$ is ${\hat{U}}_{:, 1 : k, i} {\hat{S}}_{1 : k, 1 : k, i} {\hat{V}}_{:, 1 : k, i}^{H}$ , so ${‖ {\hat{A}}_{:, :, i} - {\hat{U}}_{:, 1 : k, i} {\hat{S}}_{1 : k, 1 : k, i} {\hat{V}}_{:, 1 : k, i}^{H} ‖}_{F}^{2} \leq {‖ {\hat{A}}_{:, :, i} - {\hat{B}}_{:, :, i} ‖}_{F}^{2}$ , and the result follows.

In ref. 18, the authors used the Eckart–Young result for the t-product for compression of facial data and a PCA-like approach to recognition. They also gained additional compression in an algorithm they called the t-SVDII (only for the t-product on real-valued tensors), which, although not described as such in that paper, is effectively reducing the multirank for further compression. Here, we provide the theoretical justification for the t-SVDII approach in ref. 18 while simultaneously extending the result to the $⋆_{M}$ -product family restricted to $M$ being a nonzero multiple of a unitary matrix.

Theorem 3.8. Given the t-SVDM of $A$ under $⋆_{M}$ , define $A_{ρ}$ , to be the approximation having multirank $ρ$ : that is,

{({\hat{A}}_{ρ})}_{:, :, i} = {\hat{U}}_{:, 1 : ρ_{i}, i} {\hat{S}}_{1 : ρ_{i}, 1 : ρ_{i}, i} {\hat{V}}_{:, 1 : ρ_{i}, i}^{H} .

Then $A_{ρ}$ is the best multirank $ρ$ approximation to $A$ in the Frobenius norm and

{‖ A - A_{ρ} ‖}_{F}^{2} = \sum_{i = 1}^{n} \sum_{k = 1}^{r_{i}} {({\hat{σ}}_{ρ_{i} + k}^{(i)})}^{2},

where $r_{i}$ denotes the rank of the $i^{t h}$ frontal face of $\hat{A}$ .

Proof: Follows similarly to the above, and is omitted.

To use this in practice, we generalize the idea of the t-SVDII in ref. 18 to the $⋆_{M}$ -product for $M$ a multiple of a unitary matrix. First, we need a suitable method to choose $ρ$ . We know

{‖ \hat{A} ‖}_{F}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{r_{i}} {({\hat{σ}}_{j}^{(i)})}^{2},

where $r_{i}$ is the rank of the $i^{t h}$ frontal face of $\hat{A}$ . Thus, there are $K ≔ \sum_{i = 1}^{n} r_{i} \leq n min (m, p)$ total nonzero singular values. Let us order the ${({\hat{σ}}_{j}^{(i)})}^{2}$ values in descending order and put them into a vector of length $K$ . We find the first index $J \leq K$ such that $(\sum_{i = 1}^{J} v_{i}) / {‖ \hat{A} ‖}_{F}^{2} > γ$ . Keeping $J$ total terms thus implies an approximation of energy $γ$ . Then let $τ = \sqrt{v_{J}}$ —this will be the value of the singular value that is the smallest one which we should include in the approximation. We run back through the $n$ faces, and for face $i$ we keep only the $ρ_{i}$ singular 3-tuples such that ${\hat{σ}}_{j}^{(i)} \geq τ$ . In other words, the RE in our approximation is given by

\frac{\sum_{i = 1}^{n} \sum_{j = 1}^{ρ_{i}} {({\hat{σ}}_{j}^{(i)})}^{2}}{{‖ \hat{A} ‖}_{F}^{2}} \approx γ .

The pseudo-code is given in Algorithm 3, and a cartoon illustration of the output is given in Fig. 2.

Algorithm 3:

Return t-SVDMII under $⋆_{M}$ , $ρ$ to meet energy constraint

INPUT:

A

M

a multiple of unitary matrix; desired energy

γ \in (0,1]

1: Compute t-SVDM of

A

2: Concatenate

(({\hat{S}}_{j, j, i}) .^{2})

for all

i, j

into a vector

v

v \leftarrow sort (v,^{'} {descend}^{'})

4: Let

w

be the vector of cumulative sums: i.e.,

w_{k} = \sum_{i = 1}^{k} v_{i}

5: Find the first index

J

such that

w_{J} / {‖ \hat{S} ‖}_{F}^{2} > γ

6: Define

τ ≔ v_{J}

7: for

i = 1, \dots, n

8: Set

ρ_{i}

as number of singular values for

{\hat{A}}_{:, :, i}

greater or equal to

τ

9: Keep only the

m \times ρ_{i} {\hat{U}}_{:, 1 : ρ_{i}, i}

and

{\hat{G}}_{ρ} ≔ {\hat{S}}_{1 : ρ_{i}, 1 : ρ_{i}, i} {\hat{V}}_{:, 1 : ρ_{i}, i}^{H}

End for

Open in a new tab

In Section 5, we compare our Eckart–Young results for tensors with the corresponding matrix approximations obtained by using the matrix-based Eckart–Young theorem, but first we need a few results that show what structure these tensor approximations inherit from $M$ .

4. Latent Structure

To understand why the proposed tensor decompositions are efficient at compression and feature extraction we investigate the latent structure induced by the algebra in which we operate. We shall also capitalize on this structural analysis in the next section’s proofs stating how the proposed t-SVDM and t-SVDMII decompositions can be used to devise superior approximations compared to their matrix counterparts.

If $v, c \in C^{1 \times 1 \times n}$ , from Algorithm 1 we have

v ⋆_{M} c = ((v \times_{3} M) ⊙ ((c \times_{3} M)) \times_{3} M^{- 1},

where $⊙$ indicates pointwise scalar products on each face. Using the definitions, this expression is tantamount to

\begin{aligned} twist [{((M^{- 1} diag (M c_{(3)}) M v_{(3)})}^{⊤}] \\ = twist [{v_{(3)}}^{⊤} M^{⊤} diag (\hat{c}) M^{- ⊤}] . \end{aligned}

[5]

Note that $M c_{(3)}$ is mathematically equivalent to forming the tube fiber $\hat{c}$ , and $d i a g$ applied to a tube fiber works analogously to $d i a g$ applied to that tube fiber’s column vector equivalent. Further, the transpose is a real-valued transpose, stemming from the definition of the mode-3 product.

Let $\vec{B}$ be any element in $C^{m \times 1 \times n}$ and consider computing the product $\vec{Q} ≔ \vec{B} ⋆_{M} c$ . In ref. 11 it was shown that the $j^{t h}$ tube fiber entry in $\vec{Q}$ is effectively the product of the tubes ${\vec{B}}_{j, 1,:} ⋆_{M} c$ . From [5] we have

s q (\vec{Q}) = s q (\vec{B}) (M^{⊤} diag (\hat{c}) M^{- ⊤}) .

[6]

The matrix in the parentheses on the right is an element of the space of all matrices $X_{M} = {X : X = M^{⊤} D M^{- ⊤}}$ , where $D$ is diagonal. This brings us to a major result.

Theorem 4.1. Given $A = U ⋆_{M} \underset{C}{\underset{︸}{S ⋆_{M} V^{H}}} = \sum_{i = 1}^{t} U_{:, i,:} ⋆_{M} C_{i, :, :}$ . Then, using $R [v] ≔ M^{⊤} d i a g (\hat{v}) M^{- ⊤}$

s q (A_{:, k,:}) = \sum_{i = 1}^{t} s q (U_{:, i,:}) R [C_{i, k,:}], = \sum_{i = 1}^{t} U_{i} R [C_{i, k,:}] .

[7]

Thus, each lateral slice of $A$ is a weighted combination of “basis” matrices given by $U_{i} ≔ s q (U_{:, i,:})$ , but the weights, instead of being scalars, are matrices $R [C_{i, k,:}]$ from the matrix algebra induced by the choice of $M$ . For $M$ , the DFT matrix, the matrix algebra is the algebra of circulants.

5. Tensors and Optimal Approximations

In ref. 18, the claim was made that a t-SVD to $k$ terms could be superior to a matrix SVD based compression to $k$ terms. Here, we offer a formal proof then discuss the relative meaning of $k$ . Then, in the next section, we discuss what can be done to obtain further compression.

A. Theory: T-rank vs. Matrix Rank

Let us assume that our data are a collection of $ℓ$ , $m \times n$ matrices $D_{i}, i = 1, \dots ℓ$ . For example, $D_{i}$ might be a grayscale image, or it might be the values of a function discretized on a two-dimensional uniform grid. Let $d_{i} = vec (D_{i})$ , so that $d_{i}$ has length $m n$ .

We put samples into a matrix (tensor) from left to right:

\begin{array}{l} A & = [d_{1}, \dots, d_{ℓ}] \in C^{m n \times ℓ} \\ A & = [twist (D_{i}), \dots, twist (D_{ℓ})] \in C^{m \times ℓ \times n} . \end{array}

Thus, $A, A$ represent the same data, just in different formats. It is first instructive to consider in what ways the t-rank, $t$ , of $A$ and the matrix rank $r$ of $A$ are related. Then, we will move on to relating the optimal t-rank $k$ approximation of $A$ with the optimal rank- $k$ approximation to $A$ .

Theorem 5.1. The t-rank, $t$ , of $A$ is less than or equal to the rank, $r$ , of $A$ . Additionally, since $t \leq min (m, ℓ)$ , if $m < r$ , then $t < r$ .

Proof: The problem’s dimensions necessitate $t \leq min (m, ℓ)$ and $r \leq ℓ$ . Let $A = G H^{⊤}$ be a rank-r factorization of $A$ such that $G$ is $m n \times r$ and $H^{⊤}$ is $r \times ℓ$ . From the fact $\hat{A} = A \times_{3} M$ , we can show the $m \times ℓ$ -sized $i^{t h}$ frontal face of $\hat{A}$ satisfies

\begin{aligned} {\hat{A}}_{:, :, i} = \sum_{j = 1}^{n} m_{i j} A_{:, :, j} & = \sum_{j = 1}^{n} m_{i j} G_{(j - 1) m + j m,:} H^{⊤} \\ = (\sum_{j = 1}^{n} m_{i j} G_{j - 1) m + j m,:}) H^{⊤} . \end{aligned}

[8]

Clearly, the rank of this frontal slice is bounded above by $min (m, r)$ since this is the maximal rank of the matrix in parentheses. Then, the singular values of the matrix ${\hat{A}}_{:, :, i}$ satisfy ${\hat{σ}}_{1}^{(i)} \geq {\hat{σ}}_{2}^{(i)} \geq \dots {\hat{σ}}_{r_{i}}^{(i)}$ , where $r_{i} \leq min (m, r)$ . As ${\hat{S}}_{j, j, i} = {\hat{σ}}_{j}^{(i)}$ , $S_{j, j,:} = {\hat{S}}_{j, j,:} \times_{3} M^{- 1}$ for a particular value $j$ will be a nonzero tube fiber iff for any of the $i = 1 : n$ at least one ${\hat{σ}}_{j}^{(i)}$ is nonzero. There can be at most $min (m, r)$ nonzero tube fibers, so $t \leq min (m, r)$ .

Note that the proof was independent of the choice of invertible $M$ . In particular, it holds for $M = I$ . This means that the act of “folding” the data matrix into a tensor may provide a reduced rank approximation (a rank- $r$ matrix goes to a t-rank $< r$ tensor). A nonidentity choice of $M$ , though, may reveal $t ≪ r$ . To make the idea concrete, let us consider an example.

Example 5.2: Let $U \in R^{n \times n}$ invertible, and $c_{i} \in R^{n}$ , $i = 1, \dots, p$ with $p \leq n$ be a set of independent vectors. Define $A$ such that $A_{:, i,:} = twist (U circ (c_{i}))$ . The t-rank is 1.

On the other hand, with $Z$ the circulant downshift matrix,

A = (I \otimes U) d i a g (I, Z, \dots, Z^{n - 1}) [\begin{matrix} c_{1} & c_{2} & \dots & c_{p} \\ c_{1} & c_{2} & \dots & c_{p} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ c_{1} & c_{2} & \dots & c_{p} \end{matrix}] .

The rank of the $A$ therefore is $p$ . If $U$ and $C$ have orthonormal columns, we can show ${‖ A - A_{k} ‖}_{F}^{2} = (p - k) n$ for any $k < n$ .

B. Theory: Comparison of Optimal Approximations

In this subsection we compare the quality of approximations obtained by truncating the matrix SVD of the data matrix vs. truncating the t-SVDM of the same data as a tensor. We again assume that $m n > ℓ$ and that $A$ has rank $r \leq ℓ$ .

Let $A = U Σ V^{⊤}$ be the matrix SVD of $A$ , and denote its best rank- $k$ , $k < r$ approximation according to

A_{k} ≔ U_{:, 1 : k} C_{1 : k,:} \Rightarrow {(A_{k})}_{:, j} = \sum_{i = 1}^{k} U_{:, i} c_{i j},

[9]

for $j = 1, \dots, k$ where $C = Σ V^{⊤}$ . Finally, we need the following matrix version of [9] which we reference in the proof:

reshape ({(A_{k})}_{:, j}, [m, n]) = \sum_{i = 1}^{k} reshape (U_{:, i}, [m, n]) c_{i j} .

[10]

Theorem 5.3. Given $A$ , $A$ as defined above, with $A$ having rank $r$ and $A$ having t-rank $t$ , let $A_{k}$ denote the best rank- $k$ matrix approximation to $A$ in the Frobenius norm, where $k \leq r$ . Let $A_{k}$ denote the best t-rank- $k$ tensor approximation under $⋆_{M}$ , where $M$ is a multiple of a unitary matrix, to $A$ in the Frobenius norm. Then

{‖ S_{k + 1 : t, k + 1 : t,:} ‖}_{F} = {‖ A - A_{k} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F} .

Proof: Consider [10]. The multiplication by the scalar $c_{i j}$ in the sum is equivalent to multiplication from the right by $c_{i j} I$ . However, since $M = c W$ for unitary $W$ , we have $c_{i j} I = M^{⊤} diag (c_{i j} e) M^{- ⊤}$ , where $e$ is the vector of all ones. Define the tube fiber $C_{i, j,:}$ from the matrix-vector product $c_{i j} M^{- 1} e$ oriented into the third dimension. Then, $c_{i j} I = R [C_{i, j,:}]$ . Now we observe that [10] can be equivalently expressed as

reshape ({(A_{k})}_{:, j}) = \sum_{i = 1}^{k} reshape (U_{:, i}) R [c_{i j}] .

[11]

These can be combined into a tensor equivalent

Z_{k} ≔ \sum_{i = 1}^{k} Q_{:, i,:} ⋆_{M} C_{i,:,:} = Q ⋆_{M} C where

\begin{aligned} {(Z_{k})}_{:, j,:} & = twist (reshape ({(A_{k})}_{:, j}, [m, n])), \\ Q_{:, i,:} & = twist (reshape (U_{:, i}, [m, n])) . \end{aligned}

Since ${\hat{C}}_{:, :, i} = {\hat{Σ}}_{1 : k, 1 : k} {\hat{V}}_{:, 1 : k}^{⊤}$ , the t-rank of $C$ is $k$ . The t-rank of $Q$ must also not be smaller than $k$ , by Theorem 5.1.

Thus, given the definition of $A_{k}$ as the minimizer over all such $k$ -term “outer-products” under $⋆_{M}$ , it follows that

\begin{aligned} {‖ S_{k + 1 : t, k + 1 : t,:} ‖}_{F} & = {‖ A - A_{k} ‖}_{F} \\ \leq {‖ A - Z_{k} ‖}_{F} = {‖ A - A_{k} ‖}_{F} . \end{aligned}

Here is an example showing strict inequality is possible. Additional supporting examples are in the numerical results.

Example 5.4: Given $M$ the Haar wavelet matrix, and let

A = [\begin{matrix} 1 & 1 \\ 1 & 4 \\ 0 & 0 \\ 0 & - 3 \end{matrix}], with A_{1} = σ_{1} u_{1} v_{1}^{⊤} .

It is easily shown that ${‖ A - A_{1} ‖}_{F} = \sqrt{σ_{2}} = 1$ . Setting $A_{1} = U_{:, 1,:} ⋆_{M} S_{1,1,:} ⋆_{M} V_{:, 1,:}^{⊤}$ , we observe

\begin{aligned} {‖ A - A_{1} ‖}_{F} & = {‖ S_{2,2,:} ‖}_{F} \\ = {‖M^{⊤} [\begin{matrix} 0 \\ {\hat{σ}}_{2}^{(2)} \end{matrix}]‖}_{F} = {\hat{σ_{2}}}^{(2)} \approx 0.59236 < 1 . \end{aligned}

In the next subsection we discuss the level of approximation provided by the output of Algorithm 3 by relating it back to the truncated t-SVDM and also to truncated matrix SVD. First, we need a way to relate storage costs.

Theorem 5.5. Let $A_{k}$ be the t-SVDM t-rank $k$ approximation to $A$ , and suppose its implicit rank is $r$ . Define $μ = {‖ A_{k} ‖}_{F}^{2} / {‖ A ‖}_{F}^{2}$ . There exists $γ \leq μ$ such that the t-SVDMII approximation, $A_{ρ}$ , obtained for this $γ$ in Algorithm 3, has implicit rank less than or equal to an implicit rank of $A_{k}$ and

{‖ A - A_{ρ} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F} .

Proof: From Theorem 5.3 that ${‖ A - A_{k} ‖}_{F}^{2} = \frac{1}{c} \sum_{j = 1}^{n} \sum_{i = k + 1}^{n} {({\hat{σ}}_{i}^{(j)})}^{2}$ . The proof is by construction using $A_{k}$ as the starting point; that is, assume $ρ_{i} = k$ for each frontal slice. Let ${\hat{σ}}_{*}$ be the largest singular value we truncated; that is,

{\hat{σ}}_{*} = max_{i = 1, \dots, n} σ_{ρ_{i} + 1}^{(i)} .

Let $C = {{\hat{σ}}_{j}^{(i)} < {\hat{σ}}_{*} | 1 \leq j \leq ρ_{i}, \forall i}$ be the union of all of the singular values that were kept in the approximation and smaller than ${\hat{σ}}_{*}$ . If $C$ is empty, then $A_{ρ} = A_{k}$ and we are done.

Otherwise, ${\hat{σ}}_{*}$ was larger than at least one of the singular values we kept. Let $i_{*}$ be the index of the frontal slice containing ${\hat{σ}}_{*}$ with corresponding rank $ρ_{i_{*}}$ . Let $c_{m i n}$ be the smallest element contained in $C$ and, by definition, $c_{m i n} < {\hat{σ}}_{*}$ . Let $i_{m i n}$ be the index of the frontal slice containing $c_{m i n}$ with corresponding rank $ρ_{i_{m i n}}$ . Note that $c_{m i n}$ must be the smallest singular value kept from frontal slice $i_{m i n}$ . Set $ρ_{i_{*}} \leftarrow ρ_{i_{*}} + 1$ and $ρ_{i_{m i n}} \leftarrow ρ_{i_{m i n}} - 1$ . Then, the error ${‖ A - A_{ρ} ‖}_{F}^{2}$ has changed by an amount $- {\hat{σ}}_{*}^{2} + c_{m i n}^{2}$ .

In practice, we can decrease the implicit rank further. For convenience, assume the elements of $C$ are labeled in increasing order $(c_{1} \leq c_{2} \leq c_{3} \dots)$ . Let $π_{p} = \sum_{j = 1}^{p} c_{j}^{2}$ for $p$ less than or equal to the cardinality of $C$ . Choose the largest value of $p$ such that $π_{p} < {\hat{σ}}_{*}^{2}$ . Again, set $ρ_{i_{*}} \leftarrow ρ_{i_{*}} + 1$ and reduce the $p$ values of $ρ_{i}$ that correspond to $c_{1}, \dots, c_{p}$ . Then, the error ${‖ A - A_{ρ} ‖}_{F}^{2}$ has changed by an amount $- {\hat{σ}}_{*}^{2} + π_{p}$ and the implicit rank has decreased by $p - 1$ .

C. Storage Comparisons

Let us suppose that $κ$ is the truncation parameter for the tensor approximation and $k$ is the truncation parameter for the matrix approximation. Table 1 gives a comparison of storage for the methods we have discussed so far. Note that for the t-SVDMII it is necessary to work only in the transform domain, as moving back to the spatial domain would cause fill and unnecessary storage.

Table 1.

Comparison of storage costs for approximations of an $m \times p \times n$ tensor $A$

	Basis storage		Coefficients storage	Total implicit storage
$U_{k}$	$k m n$	$C_{k} = S_{k} V_{k}^{⊤}$	$k p$	$A_{k}$	$k (m n + p)$
$U_{κ}$	$κ m n$	$C_{κ} = S_{κ} ⋆_{M} V_{κ}^{⊤}$	$κ p n$	$A_{k}$	$κ (m n + p n)$ + st[ $M$ ]
${\hat{U}}_{ρ}$	$r m$	${\hat{C}}_{ρ} = {\hat{S}}_{ρ} △ {\hat{V}}_{ρ}^{⊤}$	$r p$	$A_{ρ}$	$r (m + p)$ + st[ $M$ ]

Open in a new tab

Top-to-bottom: a $k$ -term truncated matrix SVD expansion, a $κ$ -term truncated t-SVDM expansion, and a $ρ$ -multirank t-SVDMII. Recall that for the t-SVDMII, we store terms in the transform domain and the implicit rank for $A_{ρ}$ is $r = \sum_{i = 1}^{n} ρ_{i}$ . The notation $s t [M]$ refers to the implicit storage of $M$ , described below.

In practice, we can omit the storage costs of $M$ when using fast transform techniques, such as the DCT, DFT, or discrete wavelet transform. In the case where $M$ cannot be applied by fast transform the storage for $M$ for t-SVDM is bounded by the number of nonzeros in $M$ for t-SVDM. As we show in Section 5D, if $ρ$ has many zeros (that is, many faces in the transform domain do not contribute singular values above the threshold), we can reduce the storage to $nnz (ρ) n$ .

Discussion

If we assume we do not need to store the transformation matrix $M$ , then if $κ = k$ , Theorem 5.3 says the approximation error of the t-SVD is at least as good as the corresponding matrix approximation. In applications where we only need to store the basis terms, e.g., to do projections, the basis for the tensor approximation is better in a relative error sense than the basis for the matrix case, for the same storage. However, unless $n = 1$ , if we need to store both the basis and the coefficients, we will need more storage for the tensor case if we need to take $κ = k$ . Fortunately, in practice ${‖ A - A_{κ} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F}$ for $κ < k$ . Indeed, we already showed an example (see Example 5.2) where the error is zero for $κ = 1$ , but $k$ had to be much larger to achieve exact approximation. If $\frac{κ}{k} < \frac{m + \frac{p}{n}}{m + p}$ , then the total implicit storage of the tensor approximation of $κ$ terms is less than the total storage for the matrix case of $k$ terms.

Compared to the matrix SVD, the t-SVDMII approach can provide compression for at least as good, or better, an approximation level as indicated by the theorem. Of course, $M$ should be an appropriate one given the latent structure in the data. The t-SVDMII approach allows us to account for the more “important” features (e.g., low frequencies and multidimensional correlations) and therefore impose a larger truncation on the corresponding frontal faces because those features contribute more to the global approximation. Truncation of t-SVDM by a single truncation index, on the other hand, effectively treats all features equally and always truncates each ${\hat{A}}_{:, :, i}$ to $k$ terms, which depending on the choice of $M$ may not be as good. This is demonstrated in Section 8.

6. Comparison to Other Tensor Decompositions

Now, we compare the $⋆_{M}$ decompositions to other types of tensor representations as described in Section 1.

A. Comparison to truncated HOSVD

In this section, we wish to show how truncated HOSVD (tr-HOSVD) can be expressed using a $⋆_{M}$ -product. Then, we can compare our truncated results to the tr-HOSVD. Other truncation strategies than the one discussed below could be considered, e.g. ref. 22, but are outside the scope of this study.

The tr-HOSVD is formed by truncating to $k_{1}, k_{2}, k_{3}$ columns, respectively, the factor matrices $Q, W, Z$ and forming the $k_{1} \times k_{2} \times k_{3}$ core tensor as

C_{k} ≔ A \times_{1} Q_{:, 1 : k_{1}}^{⊤} \times_{2} W_{:, 1 : k_{2}}^{⊤} \times_{3} Z_{:, 1 : k_{3}}^{⊤},

where $k$ denotes the triple $(k_{1}, k_{2}, k_{3})$ . The tr-HOSVD approximation then is

A_{k} = C_{k} \times_{1} Q_{:, 1 : k_{1}} \times_{2} W_{:, 1 : k_{2}} \times_{3} Z_{:, 1 : k_{3}} .

We now prove the following theorem that shows the tr-HOSVD can be represented under $⋆_{M}$ when $M = Z^{⊤}$ .

Theorem 6.1. Define the $n \times n$ matrix $M$ as $M = Z^{⊤}$ (since $Z$ is unitary, it follows that $M^{- 1} = Z$ ), and define $m \times k_{1} \times n$ and $p \times k_{2} \times n$ tensors in the transform space according to

{\hat{Q}}_{:, :, i} = Q, and {\hat{W}}_{:, :, i} = W for i = 1, \dots, n .

Then $Q = \hat{Q} \times_{3} M^{- 1}, W = \hat{W} \times_{3} M^{- 1}$ and it is easy to show that $Q$ , $W$ are unitary tensors. Define $\hat{P}$ as the $p \times p \times n$ tensor with identity matrices on faces 1 to $k_{3}$ and 0 matrices from faces $k_{3} + 1$ to $n$ .

Let $C = Q_{:, 1 : k_{1},:}^{⊤} ⋆_{M} A ⋆_{M} W_{:, 1 : k_{2},:}$ . Then

A_{k} = Q_{:, 1 : k_{1},:} ⋆_{M} C ⋆_{M} W_{:, 1 : k_{2},:}^{⊤} ⋆_{M} P .

Proof: First consider

\begin{aligned} C & ≔ A \times_{1} Q_{:, 1 : k_{1}}^{⊤} \times_{2} W_{:, 1 : k_{2}}^{⊤} \times_{3} Z^{⊤} \\ = (A \times_{3} Z^{⊤}) \times_{1} Q_{:, 1 : k_{1}}^{⊤} \times_{2} W_{:, 1 : k_{2}}^{⊤} \\ = \hat{A} \times_{1} Q_{:, 1 : k_{1}}^{⊤} \times_{2} W_{:, 1 : k_{2}}^{⊤}, \end{aligned}

[12]

using properties of modewise products (see [12]). From the definitions of the modewise product the $i^{t h}$ face of $C$ as defined via [12] is $Q_{:, 1 : k_{1}}^{⊤} {\hat{A}}_{:, :, i} W_{:, 1 : k_{2}}$ . However, this means that we can equivalently represent $C$ as $C = Q_{:, 1 : k_{1},:}^{⊤} ⋆_{M} A ⋆_{M} W_{:, 1 : k_{2},:}$ .

Now $B ≔ Q_{:, 1 : k_{1},:} ⋆_{M} C ⋆_{M} W_{:, 1 : k_{2},:}^{⊤}$ implies ${\hat{B}}_{:, :, i} = Q_{:, 1 : k_{1}} {\hat{C}}_{:, :, i} W_{:, 1 : k_{2}}^{⊤}, i = 1, \dots, n$ . However, since ${\hat{C}}_{:, :, i} = Q_{:, 1 : k_{1}}^{⊤} {\hat{A}}_{:, :, i} W_{:, 1 : k_{2}}$ for $i = 1, \dots, n$ , we only need to zero-out the last $k_{3} + 1 : n$ frontal slices of $B$ to get to $A_{k}$ , which we do by taking the $⋆_{M}$ -product with $P$ on the right.

Note that in our theorem we assume $Z$ is square, but the transformation $Z$ is effectively truncated by postmultiplying by $P$ , which allows for the $⋆_{M}$ -product presentation of the tr-HOSVD. Thus, we now can compare the theoretical results from tr-HOSVD to our truncated methods.

Theorem 6.2. Given the tr-HOSVD approximation $A_{k}$ , for $κ ≔ min (k_{1}, k_{2})$ ,

{‖ A - A_{κ} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F}

with equality only if $A_{k} = A_{κ}$ .

Proof: Note that the t-ranks of $Q$ and $W ⋆_{M} P$ are $k_{1}$ and $k_{2}$ , respectively. Since $C$ is $k_{1} \times k_{2} \times n$ , its t-rank cannot exceed $κ ≔ min (k_{1}, k_{2})$ . As such, we know $A_{k}$ can be written as a sum of $κ$ outer-products of tensors under $⋆_{M}$ , and the result follows given the optimality of $A_{κ}$ .

Corollary 6.3. Given the tr-HOSVD approximation $A_{k}$ , for $κ ≔ min (k_{1}, k_{2})$ there exists $γ$ such that $A_{ρ}$ returned by Algorithm 3 has implicit rank less than or equal to $A_{κ}$ and

{‖ A - A_{ρ} ‖}_{F} \leq {‖ A - A_{κ} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F} .

Note this is independent of the choice of $k_{3}$ : The best tr-HOSVD approximation will occur when $k_{3} = n$ .

While this has theoretical value, it begs the question of whether or not $Z$ needs to be stored explicitly. When $M$ is a matrix that can be applied quickly without explicit storage, such as a discrete cosine transform, this is not a consideration. We will say more about this at the end of this section.

B. Comparison to TT-SVD

The TT-SVD was introduced in ref. 9. In recent years, the (truncated) TT-SVD has been used successfully for compressed representations of matrix operators in high-dimensional spaces and for data compression.

For a third-order tensor, the first step of the TT-SVD algorithm is to perform a truncated matrix SVD of one unfolding. Thus, the results depend upon the choice of unfolding. In order to compare with our method, we shall choose the mode-2 unfolding.^# We note that $A^{⊤} = A_{(2)}$ , and since $A = unfold (A) = U S V^{⊤}$ has been used already, we observe $A_{(2)} = V S U^{⊤}$ is the SVD of the mode-2 unfolding. The TT-SVD algorithm proceeds as follows for third-order tensors:

1. Truncate $V (S U^{⊤})$ , to $k$ terms, keep $V_{k}$ (typically, as a $1 \times p \times k$ tensor). Let $C ≔ S_{k} U_{k}^{⊤}$ , and note it is $k \times m n$ .

2. Let $L (C)$ denote $C$ reshaped to an $m k \times n$ matrix. Compute the SVD $L (C) = W D Q^{⊤}$ and truncate to $q$ terms. The $m k \times q W_{q}$ is folded into a $k \times m \times q$ tensor, and the remaining $E_{q} ≔ D_{q} Q_{q}^{⊤}$ is stored as a $q \times n \times 1$ tensor.

The truncations $k, q$ are chosen based on a user-defined threshold $ϵ$ such that ${‖ A - \tilde{A} ‖}_{F} \leq ϵ {‖ A ‖}_{F}$ , where $\tilde{A}$ is the TT-SVD approximation. The storage cost of this implicit tensor approximation is $p k + m k q + q n$ numbers.

Theorem 6.4. Let $\tilde{A}$ denote the $(k, q)$ truncated TT-SVD approximation, where $k, q$ correspond to the truncation indices that satisfy the user input threshold $ϵ$ . Then,

{‖ A - A_{k} ‖}_{F}^{2} \leq {‖ A - \tilde{A} ‖}_{F}^{2},

with strict inequality if $q$ is less than the rank of $L (C)$ . Furthermore, there exists $ρ$ such that

{‖ A - A_{ρ} ‖}_{F}^{2} \leq {‖ A - \tilde{A} ‖}_{F}^{2} .

Proof: Let $A_{(2), k} = V_{k} C$ and let $r_{1}$ be the rank of $C$ and $r_{2}$ be the rank of $L (C)$ . The steps in the algorithm imply that ${\tilde{A}}_{(2)} = V_{k} L^{- 1} (E_{q})$ . Thus,

\begin{array}{l} {‖ A_{(2)} - {\tilde{A}}_{(2)} ‖}_{F}^{2} = {‖ A_{(2)} - A_{(2), k} + A_{(2), k} - {\tilde{A}}_{(2)} ‖}_{F}^{2} \\ = {‖[\begin{matrix} C - L^{- 1} (E_{q}) \\ S_{k + 1 : r_{1}, k + 1 : r_{1}} U_{:, k + 1 : r_{1}}^{⊤} \end{matrix}]‖}_{F}^{2} = \sum_{i = k + 1}^{r_{1}} σ_{i}^{2} + {‖ L (C) - E_{q} ‖}_{F}^{2} \\ = {‖ A - A_{k} ‖}_{F}^{2} + \sum_{j = q + 1}^{r_{2}} d_{i}^{2}, \end{array}

by employing unitary-invariance of the Frobenius norm.

Applying Theorem 5.5,

\begin{array}{l} {‖ A - A_{ρ} ‖}_{F}^{2} & \leq {‖ A - A_{k} ‖}_{F} \leq {‖ A - A_{k} ‖}_{F}^{2} + \sum_{j = q + 1}^{r_{2}} d_{i}^{2} \\ = {‖ A_{(2)} - {\tilde{A}}_{(2)} ‖}_{F}^{2} = {‖ A - \tilde{A} ‖}_{F}^{2} . \end{array}

The inequality is strict if $q < r_{2}$ ; it is also strict independent of $q$ if $‖ A - A_{k} ‖_{F} < ‖ A - A_{k} ‖_{F}$ which, as noted earlier, is achievable and is almost always the case in practice.

C. Approximation in CP Form

The purpose of the interpretation of our decomposition in CP form is with an eye toward an implicit compressed representation discussed in the next subsection. Consider the formation of $A_{ρ}$ in Algorithm 3. In step 10, we keep $ρ_{i}$ terms in the matrix SVD for frontal slice $i$ of $\hat{A}$ . So

{\hat{A}}_{ρ} = \sum_{i = 1}^{n} \sum_{j = 1}^{ρ_{i}} {\hat{U}}_{:, j, i} ○ {\hat{V}}_{:, j, i} ○ {\hat{σ}}_{j}^{(i)} e_{i},

[13]

and $e_{i}$ is the $i^{t h}$ column of the $n \times n$ identity matrix. However, $A_{ρ} = {\hat{A}}_{ρ} \times_{3} M^{- 1}$ . Thus,

A_{ρ} = \sum_{i = 1}^{n} \sum_{j = 1}^{ρ_{i}} {\hat{σ}}_{j}^{(i)} ({\hat{U}}_{:, j, i} ○ {\hat{V}}_{:, j, i} ○ (M^{- 1} e_{i})) .

Next, concatenate vectors to form three matrices of $r$ columns as follows. Let $U ≔ [{\hat{U}}_{:, 1 : ρ_{1}, 1}, \dots, {\hat{U}}_{:, 1 : ρ_{n}, n}]$ similarly for $V$ . Let

W ≔ [\underset{ρ_{1}}{\underset{︸}{{\tilde{M}}_{:, 1}, \dots, {\tilde{M}}_{:, 1}}}, \dots, \underset{ρ_{n}}{\underset{︸}{{\tilde{M}}_{:, n}, \dots, {\tilde{M}}_{:, n}}}],

[14]

where $\tilde{M} = M^{- 1}$ , so that $W$ also has $r$ columns, with repeats as dictated by the $ρ_{i}$ . Let $S$ contain the ${\hat{σ}}_{ρ_{i}}^{(j)}$ . Then

A_{ρ} = [[S; U, V, W]] .

Note that if $M$ is an orthogonal (unitary) matrix then $M^{- 1} = M^{⊤}$ ( $M^{H}$ for complex $M$ ) the elements of the rightmost matrix are columns of the transpose of $M$ and hence have unit norm. The columns of $U, V$ also have unit norm.

D. Discussion: Storage of the Transformation Matrix

From [14], only columns $i$ of $\tilde{M}$ for which $ρ_{i} > 0$ need to be stored (along with a storage-negligible vector of integer pointers) so storage drops to $O (n | ρ |)$ ; see Table 2. In the numerical results on hyperspectral data, this fact, in combination with Corollary 6.3, allows the t-SVDMII to give results superior to the tr-HOSVD in both error and storage.

Table 2.

Summary of st $[M]$

	t-SVDM	t-SVDMII
Fast transform $M$	0	0
Unstructured $M$	nnz( $M$ )	$min$ (nnz $(ρ) n$ , nnz( $M$ ))

Open in a new tab

7. Multisided Tensor Compression

Consider the mapping $C^{m \times p \times n} \to C^{n \times p \times m}$ induced by matrix-transposing (without conjugation) each of the $p$ lateral slices.^∥ We use a superscript of $P$ to denote such a permuted tensor:

A^{P} = permute (A, [3,2,1]), {(A^{P})}^{P} = A .

In this section, we define techniques for compression using both orientations of the lateral slices in order to ensure a more balanced approach to the compression of the data.

A. Optimal Convex Combinations

Given $A = U ⋆_{M} S ⋆_{M} V^{H}$ and $A^{P} = W ⋆_{B} D ⋆_{B} Q^{H}$ , compress each and form

α (U_{k_{1}} ⋆_{M} S_{k_{1}} ⋆_{M} V_{k_{1}}^{H}) + (1 - α) {(W_{k_{2}} ⋆_{B} D_{k_{2}} ⋆_{B} Q_{k_{2}}^{H})}^{P} .

Observe that $unfold (A^{P}) = P A$ , where $A = unfold (A)$ as before and $P$ denotes a stride permutation matrix. Since $P$ is orthogonal, the singular values and right singular vectors of $A$ are the same as those of $P A$ , and the left singular vectors are row permuted by $P$ . For a truncation parameter $r$ ,

‖ A^{P} - {(A^{P})}_{r} ‖_{F} \leq ‖ A - A_{r} ‖_{F} .

It follows that for $β ≔ 1 - α$

\begin{aligned} ‖ A - {(α A_{k_{1}} + β {(A_{k_{2}}^{P})}^{P}) ‖}_{F} & \leq α ‖ A - A_{k_{1}} ‖_{F} + β ‖ A - A_{k_{2}} ‖_{F} \\ \leq ‖ A - A_{min (k_{1}, k_{2})} ‖_{F} . \end{aligned}

Similarly for the optimal t-SVDMII approximations, $A_{ρ, δ} ≔ α A_{ρ} + β {(A_{δ}^{P})}^{P}$ , where $ρ, δ$ are the multiindices for each orientation, respectively, which may have been determined with different energy levels. From Theorem 3.8,

{‖ A - A_{ρ, δ} ‖}_{F}^{2} = α^{2} \sum_{i = 1}^{n} \sum_{j = ρ_{i} + 1}^{r_{i}} {({\hat{σ}}_{j}^{(i)})}^{2} + β^{2} \sum_{k = 1}^{m} \sum_{j = δ_{k} + 1}^{{\tilde{r}}_{k}} {({\tilde{σ}}_{j}^{(k)})}^{2},

where $r_{i}$ is the rank of ${\hat{A}}_{:, :, i}, i = 1 : n$ under $⋆_{M}$ , ${\tilde{r}}_{k}$ is the rank of ${\hat{A^{p}}}_{:, :, k}$ under $⋆_{B}$ and ${\tilde{σ}}_{j}^{(k)}$ are for $\hat{A^{p}}$ under $⋆_{B}$ as well.

B. Sequential Compression

Sequential compression is also possible, as described in ref. 23. A second tensor SVD is applied to $C^{p}$ , making each step locally optimal. Due to space constraints, we will not elaborate on this further here.

8. Numerical Examples

In the following discussion, the compression ratio (CR) is defined as the number of floating point numbers needed to store the uncompressed data divided by the number of floating point numbers needed to store the compressed representation (in its implicit form). Thus, the larger the ratio, the better the compression. The relative error (RE) is the ratio of the Frobenius-norm difference between the original data and the approximation over the Frobenius-norm of the original data.

A. Compression of Yale B Data

In this section, we show the power of compression for the t-SVDMII approach with appropriate choice of $M$ , that is, $M$ exploits structure inherent in the data. We create a third-order tensor from the Extended Yale B face database (24) by putting the training images in as lateral slices in the tensor. Then, we apply Algorithm 3, varying $ρ$ , for four different choices of $M$ : We choose $M$ as a random orthogonal matrix, we use $M$ as an orthogonal wavelet matrix, we use $M$ as the unnormalized DCT matrix, and we form $M$ in a data-driven approach by taking the transpose of the left factor matrix $Z$ of the mode-3 unfolding. We have chosen to use a random orthogonal matrix in this experiment to show that the compression power is relative to structure that is induced through the choice of $M$ , so we do not expect, nor do we observe, value in choosing $M$ to be random. In Fig. 3, we plot the CR against the RE in the approximation. We observe that for RE on the order of 10 to 15% the margins in compression achieved by the t-SVDMII for both the DCT, the wavelet transform, and the data-driven transform vs. treating the data in either matrix form, or in choosing a transform that—like the matrix case—does not exploit structure in the data, are quite large. To evaluate only compressibility associated with the transformation we do not count the storage of $M$ .

Fig. 3. — Illustration of the compressive power of the t-SVDIIM in *Algorithm 3* for appropriate choices of $M$ . Far more compression is achieved using either the DCT or wavelet transform to define $⋆_{M}$ , as well as defining $⋆_{M}$ in a data-driven approach since they capitalize on structural features in the data.

B. Video Frame Data

For this experiment, we use video data available in MATLAB.** The video consists of 120, $120 \times 160$ frames in grayscale. The camera is positioned near one spot in the road, and cars travel on that road (more or less from the top to the bottom as the frames progress), so the only changes per frame are cars entering and disappearing.

We compare the performance of our truncated t-SVDMII, for $M$ being the DCT matrix, against the truncated matrix and truncated HOSVD approximations. We orient the frames as transposed lateral slices in the tensor to confine the change in one lateral slice to another to rows where there is car movement.^†† Thus, $A$ is $160 \times 120 \times 120$ .

With both the truncated t-SVDMII and truncated matrix SVD approaches we can truncate based on the same energy value. Thus, we get RE in our respective approximations with about the same value, and then we can compare the relative compression. Alternatively, we can fix our energy value and compute our truncated t-SVDMII and find its CR. Then, we can compute the truncated matrix SVD approximation with similar CR and compare its RE to the tensor-based approximation. We show some results for each of these two ways of comparison.

There are many ways of choosing the truncation 3-tuple for HOSVD. Trying to choose a 3-tuple that has a comparable relative approximation to our approach would be cumbersome and would still leave ambiguities in the selection process. Thus, we employ two truncation methods that yield an approximation best matching the CRs of our tensor approximation. The indices are chosen as follows: 1) Compress only on the second mode (i.e., change $k_{2}$ , fix $k_{1} = m, k_{3} = n$ ) and 2) choose truncation parameters on dimensions such that the modewise compression to dimension ratios are about the same. The second option amounts to looping over $k_{2} = 1, \dots, p$ , setting $k_{1} = ⌊ \frac{k_{2} m}{n} ⌋$ , and $k_{3} = k_{2}$ . Thus, it is possible to compute the CR for the tr-HOSVD based on the dimension in advance to find the closest match to the desired compression levels. The results are given in Table 3.

Table 3.

Video experiment results

	t-SVDMII	Matrix		tr-HOSVD
	$γ_{1} = . 998$	$γ_{1}$	CR	$(m, k_{2}, n)$	$(k_{1}, k_{2}, k_{2})$
CR	4.76	1.83	4.76	4.95	4.90
RE	0.044	0.045	0.093	0.098	0.065

	$γ_{2} = . 996$	$γ_{1}$	CR	$(m, k_{2}, n)$	$(k_{1}, k_{2}, k_{2})$

CR	10.10	2.54	10.87	10.75	10.42
RE	0.063	0.064	0.120	0.125	0.090

Open in a new tab

Matrix-based compression can be for predefined relative energy $γ$ (which affects RE) or set to achieve desired compression, so we performed both. For experiment 1 with $γ_{1}$ , the $k_{2}$ and $(k_{1}, k_{2}, k_{3})$ values that gave the same compression results were 25 and (92, 69, 69), respectively; for the second experiment these were 11 and (70, 53, 53), respectively. Truncations for the matrix case were 65 and 25 for the $γ_{1}$ experiment and 47 and 11 for the second.

We can also visualize the impact of the compression schemes. In Fig. 4 we give the corresponding reconstructed representations of frame 10 and 54 for the four methods under comparable CR for the second $γ$ (columns 2 and 4 through 6, second row-block of the table, i.e., the results corresponding to the most compression). Cars disappear altogether and/or artifacts make it appear as though cars may be in the frame when they are not—at these compression levels, the matrix and tr-HOSVD suffer from a ghosting effect.

C. Hyperspectral Image Compression

We compare the performance of the t-SVDMII and the truncated HOSVD based on our results in Corollary 6.3 on hyperspectral data. The hyperspectral data comes from flyover images of the Washington, DC mall with a sensor that measured 210 bands of visible infrared spectra (25). After removing the opaque atmospheric levels the data consist of 191 images corresponding to different wavelengths and each image is of size 307 × 1,280.

To reduce the computational cost for this comparison, and to allow for equal truncation in both spatial dimensions, we resize the images using MATLAB imresize to $300 \times 300$ images. We store the data in a tensor $A$ of size $300 \times 300 \times 191$ , where the third dimension contains the wavelengths. We choose this orientation because we use $M = Z^{⊤}$ , where $Z$ contains the left singular vectors of $A_{(3)}$ . We can expect a high correlation and hence high compressibility (i.e., many $ρ_{i}$ will be 0) along the third dimension as they correspond to exactly the same spatial location at each wavelength.

For a fair comparison, we relate the t-SVDMII, the tr-HOSVD, and the TT-SVD via the truncation parameter $κ$ of the truncated t-SVDM $A_{κ}$ using $M = Z^{⊤}$ . For the t-SVDMII based on Theorem 5.5 we use the energy parameter $γ = {‖ A_{κ} ‖}_{F}^{2} / {‖ A ‖}_{F}^{2}$ . For the tr-HOSVD based on Theorem 6.2 we use a variety of multilinear truncations $(κ, k_{2}, k_{3})$ and $(k_{1}, κ, k_{3})$ for various choices of $k_{1}, k_{2} \leq κ$ and $k_{3}$ . For the TT-SVD based on Theorem 6.4 we choose the accuracy threshold $ϵ$ such that the truncation of the first unfolding is $κ$ . Note we first permute the tensor $A$ so that the first TT-SVD unfolding is the mode-2 unfolding. For completeness, we include the matrix SVD $A_{κ}$ and apply Theorem 5.3. We display the RE-vs.-CR results in Fig. 5.

The results use the t-SVDMII storage method described in Section 6D. Even when counting the storage of $M$ , the t-SVDMII outperforms the matrix SVD and the tr-HOSVD and is highly competitive with the TT-SVD and outperforms for larger values of $κ$ (i.e., less truncation, but better approximation quality). For example, if we compare the t-SVDMII result with $κ = 250$ (dark gray diamond) with the TT-SVD result with $κ = 290$ (light gray circle), we see that the t-SVDMII produces a better approximation for roughly the same CR. The color-coded results in Fig. 5 follow directly from writing the tr-HOSVD in our $⋆_{M}$ -framework and the optimality condition in Theorem 6.2 and follow from relating the TT-SVD to the matrix SVD and using our Eckart–Young result in Theorem 5.3. Specifically, we see that for each choice of $κ$ the t-SVDMII approximation has a smaller RE than all other methods.

We also note that for the HOSVD in Fig. 5 a large number of combinations of multirank parameters $(k_{1}, k_{2}, k_{3})$ are possible and can drastically change the approximation–compression relationship. Finding a reasonable tr-HOSVD multirank via trial and error may not be feasible in practice. In comparison, there is only one parameter for the t-SVDMII and TT-SVD which impacts the approximation–compression relationship more predictably, particularly because the approximation quality is known a priori.

D. Extension to Four Dimensions and Higher

Although the algorithms and optimality results were described for third-order tensors, the algorithmic approach and optimality theory can be extended to higher-order tensors since the definitions of the tensor-tensor products extend to higher-order tensors in a recursive fashion, as shown in ref. 26. Ideas based on its truncation for compression and high-order tensor completion and robust PCA can be found in the literature (19, 27). As noted in ref. 11, a similar recursive construct can be used for higher-order tensors for the $⋆_{M}$ -product, or different combinations of transform-based products can be used along different modes. We give the four-dimensional t-SVDMII algorithm in ref. 23. Due to space constraints, we do not discuss this further here.

9. Conclusions and Ongoing Work

We have proved Eckart–Young theorems for the t-SVDM and t-SVDMII. Importantly, we showed that the truncated t-SVDM and t-SVDMII representations give better approximation to the data than the corresponding matrix-based SVD compression. Although the superiority of tensor-based compression has been observed by many in the literature, our result proves this theoretically. By interpreting the HOSVD in the $⋆_{M}$ -framework, we developed the relationship between the HOSVD and the t-SVDM and t-SVDMII and then applied our Eckart–Young results to show how our tensor approximations improve on the Frobenius norm error. We were also able to apply our Eckart–Young theorems to compare Frobenius norm error with that of a TT-SVD approximation.

We briefly considered both multisided compression and extensions of our work to higher-order tensors. In future work, means for optimizing $α$ , $k$ , and $j$ such that the upper bound on the error is minimized while minimizing the total storage will be investigated.

The choice of $M$ defining the tensor-tensor product should be tailored to the data for best compression. Therefore, consideration as for how to best design $M$ to suit the dataset shall be pursued in future work.

Acknowledgments

M.E.K.’s work was partially supported by a grants from IBM Thomas J. Watson Research Center, Tufts T-Tripods Institute (under NSF Harnessing the Data Revolution grant CCF-1934553), and NSF grant DMS-1821148.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

*If $A$ is $m \times 1 \times n$ , the MATLAB command $squeeze (A)$ returns the $m \times n$ matrix.

^†Since computation of the best rank- $k$ CP decomposition requires an iterative method and does not require orthogonality, CP results are not brought for comparison in this study.

^$‡$The reader should regard the elements of the tensor as $1 \times 1 \times n$ tube fibers under $⋆_{M}$ . This forms a free module (see ref. 11). The analogy to elemental inner-product-like definitions over the underling free module induced by $⋆_{M}$ and space of tube-fibers is referenced in section 4 of ref. 11 for general $⋆_{M}$ -products and in section 3 of ref. 13 for the t-product. The notion of orthogonal/unitary tensors is therefore consistent with this generalization of inner-products, which is captured in the first part of the definition.

^§The term “t-rank” is exclusive to the $⋆_{M}$ tensor decomposition and should not be confused with the rank of a tensor, which was defined in Section 1.

^¶Specfically, we mean the minimizer of the Frobenius norm of the discrepancy between the original and the approximated tensor.

^#If we want to choose a different mode for the first unfolding we would correspondingly permute the tensor prior to decomposing with our method.

^$∥$In MATLAB, this would be obtained by using the command $permute (A, [3,2,1])$ .

**The video data are built-in to MATLAB and can be loaded using trafficVid = VideoReader(‘traffic.mj2′).

^††Orienting with transposed frames did not affect performance significantly.

Data Availability

There are no data underlying this work.

References

1.Stewart G. W., On the early history of the singular value decomposition. SIAM Rev. 35, 551–566 (1993). [Google Scholar]
2.Eckart C., Young G., The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936). [Google Scholar]
3.Watkins D. S., Fundamentals of Matrix Computations (Wiley, ed. 3, 2010). [Google Scholar]
4.Hitchcock F. L., The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6, 164–189 (1927). [Google Scholar]
5.Carroll J. D., Chang J., Analysis of individual differences in multidimensional scaling via an n-way generalization of ‘Eckart-Young’ decomposition. Psychometrika 35, 283–319 (1970). [Google Scholar]
6.Harshman R.. “Foundations of the PARAFAC procedure: Models and conditions for an ”explanatory” multi-modal factor analysis” (UCLA Working Papers in Phonetics, 16, 1970). [Google Scholar]
7.Tucker L. R., “Implications of factor analysis of three-way matrices for measurement of change” in Problems in Measuring Change, Harris C. W., Ed. (University of Wisconsin Press, Madison WI, 1963),pp. 122–137. [Google Scholar]
8.De Lathauwer L., De Moor B., Vandewalle J., A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 1253–1278. [Google Scholar]
9.Oseledets I. V., Tensor-train decomposition. SIAM J. Sci. Comput. 33, 2295–2317 (2011). [Google Scholar]
10.Kilmer M. E., Martin C. D., Factorization strategies for third-order tensors. Lin. Algebra Appl. 435, 641–658 (2011). [Google Scholar]
11.Kernfeld E., Kilmer M., Aeron S., Tensor–tensor products with invertible linear transforms. Lin. Algebra Appl. 485, 545–570 (2015). [Google Scholar]
12.Kolda T., Bader B., Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009). [Google Scholar]
13.Kilmer M. E., Braman K., Hao N., Hoover R. C., Third-order tensors as operators on matrices: A theoretical and computational framework with applications in imaging. SIAM J. Matrix Anal. Appl. 34, 148–172 (2013). [Google Scholar]
14.Grasedyck L., Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl. 31, 2029–2054 (2010). [Google Scholar]
15.Hackbusch W., Tensor Spaces and Numerical Tensor Calculus (Springer Series in Computational Mathematics, Springer, 2012), vol. 42. [Google Scholar]
16.Vannieuwenhoven N., Vandebril R., Meerbergen K., A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34, A1027–A1052 (2012). [Google Scholar]
17.Hillar C. J., Lim L., Most tensor problems are NP-Hard. J. ACM 60, 1–39 (2013). [Google Scholar]
18.Hao N., Kilmer M. E., Braman K., Hoover R. C., Facial recognition using tensor-tensor decompositions. SIAM J. Imag. Sci. 6, 457–463 (2013). [Google Scholar]
19.Zhang Z., Ely G., Aeron S., Hao N., Kilmer M., “Novel methods for multilinear data completion and de-noising based on tensor-SVD” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE, 2014), pp. 3842–3849. [Google Scholar]
20.Zhang Y., et al. , “Multi-view spectral clustering via tensor-SVD decomposition” in 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI, 2017), pp. 493–497. [Google Scholar]
21.Sagheer S. V. M., George S. N., Kurien S. K., Despeckling of 3D ultrasound image using tensor low rank approximation. Biomed. Signal Process Contr. 54, 101595 (2019). [Google Scholar]
22.Ballester-Ripoll R., Lindstrom P., Pajarola R., TTHRESH: Tensor compression for multidimensional visual data. arXiv [Preprint] (2018). https://arxiv.org/abs/1806.05952 (Accessed 1 December 2020). [DOI] [PubMed]
23.Kilmer M. E., Horesh L., Avron H., Newman E., Tensor-tensor algebra for optimal representation and compression. arXiv [Preprint] (2019). https://arxiv.org/abs/2001.00046 (Accessed 1 December 2020). [DOI] [PMC free article] [PubMed]
24.Georghiades A. S., Belhumeur P. N., Kriegman D. J., From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001). [Google Scholar]
25.Landgrebe D., Biehl L., An introduction and reference for MultiSpec (2019). https://engineering.purdue.edu/biehl/MultiSpec/. Accessed 1 December 2020.
26.Martin C. D., Shafer R., LaRue B., An order- $p$ tensor factorization with applications in imaging. SIAM J. Sci. Comput. 35, A474–A490 (2013). [Google Scholar]
27.Ely G., Aeron S., Hao N., Kilmer M., 5D seismic data completion and denoising using a novel class of tensor decompositions. Geophysics 80, V83–V95 (2015). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

There are no data underlying this work.

[r1] 1.Stewart G. W., On the early history of the singular value decomposition. SIAM Rev. 35, 551–566 (1993). [Google Scholar]

[r2] 2.Eckart C., Young G., The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936). [Google Scholar]

[r3] 3.Watkins D. S., Fundamentals of Matrix Computations (Wiley, ed. 3, 2010). [Google Scholar]

[r4] 4.Hitchcock F. L., The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6, 164–189 (1927). [Google Scholar]

[r5] 5.Carroll J. D., Chang J., Analysis of individual differences in multidimensional scaling via an n-way generalization of ‘Eckart-Young’ decomposition. Psychometrika 35, 283–319 (1970). [Google Scholar]

[r6] 6.Harshman R.. “Foundations of the PARAFAC procedure: Models and conditions for an ”explanatory” multi-modal factor analysis” (UCLA Working Papers in Phonetics, 16, 1970). [Google Scholar]

[r7] 7.Tucker L. R., “Implications of factor analysis of three-way matrices for measurement of change” in Problems in Measuring Change, Harris C. W., Ed. (University of Wisconsin Press, Madison WI, 1963),pp. 122–137. [Google Scholar]

[r8] 8.De Lathauwer L., De Moor B., Vandewalle J., A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 1253–1278. [Google Scholar]

[r9] 9.Oseledets I. V., Tensor-train decomposition. SIAM J. Sci. Comput. 33, 2295–2317 (2011). [Google Scholar]

[r10] 10.Kilmer M. E., Martin C. D., Factorization strategies for third-order tensors. Lin. Algebra Appl. 435, 641–658 (2011). [Google Scholar]

[r11] 11.Kernfeld E., Kilmer M., Aeron S., Tensor–tensor products with invertible linear transforms. Lin. Algebra Appl. 485, 545–570 (2015). [Google Scholar]

[r12] 12.Kolda T., Bader B., Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009). [Google Scholar]

[r13] 13.Kilmer M. E., Braman K., Hao N., Hoover R. C., Third-order tensors as operators on matrices: A theoretical and computational framework with applications in imaging. SIAM J. Matrix Anal. Appl. 34, 148–172 (2013). [Google Scholar]

[r14] 14.Grasedyck L., Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl. 31, 2029–2054 (2010). [Google Scholar]

[r15] 15.Hackbusch W., Tensor Spaces and Numerical Tensor Calculus (Springer Series in Computational Mathematics, Springer, 2012), vol. 42. [Google Scholar]

[r16] 16.Vannieuwenhoven N., Vandebril R., Meerbergen K., A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34, A1027–A1052 (2012). [Google Scholar]

[r17] 17.Hillar C. J., Lim L., Most tensor problems are NP-Hard. J. ACM 60, 1–39 (2013). [Google Scholar]

[r18] 18.Hao N., Kilmer M. E., Braman K., Hoover R. C., Facial recognition using tensor-tensor decompositions. SIAM J. Imag. Sci. 6, 457–463 (2013). [Google Scholar]

[r19] 19.Zhang Z., Ely G., Aeron S., Hao N., Kilmer M., “Novel methods for multilinear data completion and de-noising based on tensor-SVD” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE, 2014), pp. 3842–3849. [Google Scholar]

[r20] 20.Zhang Y., et al. , “Multi-view spectral clustering via tensor-SVD decomposition” in 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI, 2017), pp. 493–497. [Google Scholar]

[r21] 21.Sagheer S. V. M., George S. N., Kurien S. K., Despeckling of 3D ultrasound image using tensor low rank approximation. Biomed. Signal Process Contr. 54, 101595 (2019). [Google Scholar]

[r22] 22.Ballester-Ripoll R., Lindstrom P., Pajarola R., TTHRESH: Tensor compression for multidimensional visual data. arXiv [Preprint] (2018). https://arxiv.org/abs/1806.05952 (Accessed 1 December 2020). [DOI] [PubMed]

[r23] 23.Kilmer M. E., Horesh L., Avron H., Newman E., Tensor-tensor algebra for optimal representation and compression. arXiv [Preprint] (2019). https://arxiv.org/abs/2001.00046 (Accessed 1 December 2020). [DOI] [PMC free article] [PubMed]

[r24] 24.Georghiades A. S., Belhumeur P. N., Kriegman D. J., From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001). [Google Scholar]

[r25] 25.Landgrebe D., Biehl L., An introduction and reference for MultiSpec (2019). https://engineering.purdue.edu/biehl/MultiSpec/. Accessed 1 December 2020.

[r26] 26.Martin C. D., Shafer R., LaRue B., An order- $p$ tensor factorization with applications in imaging. SIAM J. Sci. Comput. 35, A474–A490 (2013). [Google Scholar]

[r27] 27.Ely G., Aeron S., Hao N., Kilmer M., 5D seismic data completion and denoising using a novel class of tensor decompositions. Geophysics 80, V83–V95 (2015). [Google Scholar]

PERMALINK

Tensor-tensor algebra for optimal representation and compression of multiway data

Misha E Kilmer

Lior Horesh

Haim Avron

Elizabeth Newman

Significance

Abstract

1. Introduction

A. Overview.

B. Paper Organization.

2. Background

A. Notation and Indexing

Fig. 1.

B. A Family of Tensor-Tensor Products

Algorithm 1:

C. Tensor Algebraic Framework

3. Tensor ⋆M*M-SVDs and Optimal Truncated Representation

A. Unitary Invariance

Fig. 2.

Algorithm 2:

B. Eckart–Young Theorem for Tensors

Algorithm 3:

4. Latent Structure

5. Tensors and Optimal Approximations

A. Theory: T-rank vs. Matrix Rank

B. Theory: Comparison of Optimal Approximations

C. Storage Comparisons

Table 1.

Discussion

6. Comparison to Other Tensor Decompositions

A. Comparison to truncated HOSVD

B. Comparison to TT-SVD

C. Approximation in CP Form

D. Discussion: Storage of the Transformation Matrix

Table 2.

7. Multisided Tensor Compression

A. Optimal Convex Combinations

B. Sequential Compression

8. Numerical Examples

A. Compression of Yale B Data

Fig. 3.

B. Video Frame Data

Table 3.

Fig. 4.

C. Hyperspectral Image Compression

Fig. 5.

D. Extension to Four Dimensions and Higher

9. Conclusions and Ongoing Work

Acknowledgments

Footnotes

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. Tensor $⋆_{M}$ *M-SVDs and Optimal Truncated Representation