Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

Yuchen Zhou; Anru R Zhang; Lili Zheng; Yazhen Wang

doi:10.1109/tit.2022.3152733

. Author manuscript; available in PMC: 2023 Jun 1.

Published in final edited form as: IEEE Trans Inf Theory. 2022 Feb 18;68(6):3991–4019. doi: 10.1109/tit.2022.3152733

Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

Yuchen Zhou ¹, Anru R Zhang ², Lili Zheng ³, Yazhen Wang ⁴

PMCID: PMC9585995 NIHMSID: NIHMS1809459 PMID: 36274655

Abstract

This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD [1] and new iterative backward/forward updates. We develop the general upper bound on estimation error for TTOI with the support of several new representation lemmas on tensor matricizations. By developing a matching information-theoretic lower bound, we also prove that TTOI achieves the minimax optimality under the spiked tensor model. The merits of the proposed TTOI are illustrated through applications to estimation and dimension reduction of high-order Markov processes, numerical studies, and a real data example on New York City taxi travel records. The software of the proposed algorithm is available online (https://github.com/Lili-Zheng-stat/TTOI).

Index Terms—: Tensor SVD, tensor-train, high-order tensors, orthogonal iteration, minimax optimality, high-order Markov chain

I. Introduction

Tensors, or high-order arrays, have attracted increasing attention in modern machine learning, computational mathematics, statistics, and data science. Some specific examples include recommender systems [2], [3], neuroimaging analysis [4], [5], latent variable learning [6], multidimensional convolution [7], signal processing [8], neural network [9], [10], computational imaging [11], [12], contingency table [13], [14]. In addition to low-order tensors (e.g., tensor with a relatively small value of order number), the high-order tensors also commonly arise in applications in statistics and machine learning. For example, in convolutional neural networks, parameters in fully connected layers can be represented as high-order tensors [15], [16]. In an order-d Markov process, where the future states depend on jointly the current and (d − 1) previous states, the transition probabilities form an order-(d + 1) tensor. For an order-d Markov decision process, the transition probabilities can be represented by an order-(2d + 1) tensor, with additional d directions representing past d actions. High-order tensors are also used to represent the joint probability in Markov random fields [17].

Compared to the low-order tensors, high-order tensors encompass much more parameters and sophisticated structure, while leading to inhibitive cost in storage, processing, and analysis: an order-d dimension-p tensor contains p^d parameters. To address this issue, some low-dimensional parametrization is usually considered to capture the most informative subspaces in the tensor. In particular, the tensor-train (TT) decomposition [18], [19], [20], [1], [21] introduced a classic low-dimensional parameterization to model the subspaces and latent cores in high-order tensor structures. TT decomposition has been used in a wide range of applications in physics and quantum computation [22], [18], [21], [23], [24], signal processing [8], and supervised learning [25] among many others. For example, the TT decomposition framework is utilized in quantum information science for modeling complex quantum states and handling the quantum mean value problem [22], [18], [21], [23]. The TT-decomposition of a tensor $X \in ℝ^{p_{1} \times \dots \times p_{d}}$ is defined as below:

X_{i_{1}, \dots, i_{d}} = G_{1, [i_{1}, :]} G_{2, [:, i_{2}, :]} \dots G_{d - 1, [:, i_{d - 1}, :]} G_{d, [i_{d}, :]}^{⊤} = \sum_{α_{1} = 1}^{r_{1}} \dots \sum_{α_{d - 1} = 1}^{r_{d - 1}} G_{1, [i_{1}, α_{1}]} G_{2, [α_{1}, i_{2}, α_{2}]} \dots G_{d - 1, [α_{d - 2}, i_{d - 1}, α_{d - 1}]} G_{d, [i_{d}, α_{d - 1}]} .

(1)

Here, the smallest values of r₁, …, r_d−1 that enable the decomposition (1) are called the TT-rank of $X$ . [1] shows that the TT-rank $r_{k} = rank ({[X]}_{k})$ , i.e., the rank of the kth sequential unfolding of $X$ (see formal definition of sequential unfolding in Section II-A). $G_{1} \in ℝ^{p_{1} \times r_{1}}$ , $G_{k} \in ℝ^{r_{k - 1} \times p_{k} \times r_{k}}$ , $G_{d} \in ℝ^{p_{d} \times r_{d - 1}}$ are the TT-cores that multiply sequentially like a “train”: $X_{i_{1}, \dots, i_{d}}$ equals the product of i₁th vector in G₁, i₂th matrix in $G_{2}, \dots, i_{d - 1}$ matrix in $G_{d - 1}$ , and i_dth vector in G_d. For convenience of presentation, we simplify (1) to

X = 〚 G_{1}, G_{2}, \dots, G_{d - 1}, G_{d} 〛

and denote r₀ = r_d = 1 throughout the paper. In particular, the TT rank and TT decomposition reduce to the regular matrix rank and decomposition when d = 2. If all dimensions p and ranks r are the same, the TT-parametrization involves O(2pr + (d − 2)pr²) values, which can be significantly smaller than the ones for Tucker-decomposition O(r^d + dpr) and the regular parameterization O(p^d).

In most of the existing literature, the TT-decomposition was considered under the deterministic settings, and the central goal was often to approximate the nonrandom high-order tensors by low-dimensional structures [26], [27], [1]. However, in modern applications in data science such as Markov processes, Markov decision processes, and Markov random fields, the (transition) probability tensor computed based on data is often a random realization of the underlying true tensor. In these cases, the estimation of the underlying low-dimensional parameters hidden in the noisy observations can be more important: an accurate estimation of the transition tensor renders reliable prediction for future states in high-order Markov chains and better decision-making in high-order Markov decision processes; an accurate estimation of probability tensor sheds light on the underlying relationship among different variables in a random system [17]. To achieve such a goal, it is crucial to develop dimension reduction methods that can incorporate TT-decomposition into probabilistic models. Since singular value decomposition (SVD) is one of the most important dimension reduction methods involving probabilistic models for matrices, and there is no counterpart of it for high-order tensors, we aim to fill this void by developing a statistical framework and a computationally feasible method for high-order tensor SVD in this paper.

A. Problem Formulation

This paper focuses on the following high-order tensor SVD model. Suppose we observe an order-d tensor $Y$ that contains a hidden tensor-train (TT) low-rank structure:

Y = X + Z, Y, X, Z \in ℝ^{\otimes_{k = 1}^{d} p_{k}} .

(2)

Here, $X$ is TT-decomposable as (1) and $Z$ is a noise tensor. Our goal is to estimate $X$ and the TT cores of $X$ based on $Y$ . To this end, a straightforward idea is to minimize the approximation error as follows,

\hat{X} = \underset{A is decomposable as (1)}{arg min} ‖ Y - A ‖_{F}^{2} .

(3)

However, the approximation error minimization (3) is highly non-convex and finding the global optimal solution, even if the rank r₁ = ⋯ = r_d−1 = 1, is NP-hard in general [28]. Instead, a variety of computationally feasible methods have been proposed to approximate the best tensor-train low-rank decomposition in the literature. TT-SVD, a sequential singular value thresholding scheme, was introduced by [1] to be discussed in detail later. [1] also proposed TT-rounding via sequential QR decompositions, which reduces the TT-rank while ensuring approximation accuracy. [29] introduced the alternating minimal energy algorithm to reconstruct a TT-low-rank tensor approximately based on only a small proportion of revealed entries of the target tensor. [30, Section L.2] proposed a sketching-based algorithm for fast low TT rank approximation of arbitrary tensors. [26] studied the tensor-train decomposition for functional tensors. [31] proposed the FastTT algorithm for fast sparse tensor decomposition based on parallel vector rounding and TT-rounding. [32] studied dynamical approximation with TT format for time-dependent tensors. [33] proposed the alternating least squares for tensor completion in the TT format. [34] studied the completion of low TT rank tensor and the applications to color image and video recovery. [35] studied the Riemannian optimization methods for TT decomposition and completion. Also see [36] for a TT decomposition library in TensorFlow. To our best knowledge, the estimation performance of most procedures here remains unclear. Departing from these existing work, in this paper, we make a first attempt to minimize the estimation error of $X$ in addition to achieving the minimal approximation error under possibly random settings.

B. Our Contributions

Under Model (2), we make the following contributions to high-order tensor SVD in this paper.

First, we propose a new algorithm, Tensor-Train Orthogonal Iteration (TTOI), that provides a computationally efficient estimation of the low-rank TT structure from the noisy observation. The proposed algorithm includes two major steps. First, we obtain initial estimates ${\hat{G}}_{1}^{(0)}, {\hat{G}}_{2}^{(0)}, \dots, {\hat{G}}_{d - 1}^{(0)}, {\hat{G}}_{d}$ by performing forward sequential SVD based on matricizations and projections. This step was known as TT-SVD in the literature [1]. Next, we utilize the initialization and perform the newly developed backward updates and forward updates alternatively and iteratively. The TTOI procedure will be discussed in detail in Section II.

To see why the TTOI iterations yield better estimation than the classic TT-SVD method, recall that TT-SVD first performs singular value thresholding on ${[Y]}_{1}$ , i.e., the unfolding of $Y$ , without any additional updates (see detailed procedure of TT-SVD and formal definition of ${[Y]}_{1}$ in Section II-A), which can be inaccurate since ${[Y]}_{1}$ , a $p_{1} - by - \prod_{k = 2}^{d} p_{k}$ matrix, has a great number of columns. In contrast, TTOI iteration utilizes the intermediate outcome of the previous iteration to substantially reduce the dimension of ${[Y]}_{1}$ while performing singular value thresholding. In Figure 1, we provide a simple simulation example to show that even one TTOI iteration can significantly improve the estimation of the left singular subspace of G₁ (left panel) and the overall tensor $X$ (right panel). Therefore, a one-step TTOI, i.e., the initialization with one TTOI iteration, can be used in practice when the computational cost is a concern.

Fig. 1. — Average estimation error (dots) and standard deviation (bars) of $‖ sin Θ ({\hat{U}}_{1}, U_{1}) ‖$ and $‖ \hat{X} - X ‖_{F}$ by TT-SVD and one-step TTOI. Both algorithms are performed based on the observation $Y$ generated from (2), where $Z \overset{i.i.d.}{~} N (0, σ^{2})$ , $X$ is a randomly generated order-5 tensor based on (1) with p = 20, r = 1, G₁, $G_{2}, \dots, G_{d - 1}$ , $G_{d} \overset{i.i.d.}{~} N (0, 1)$ .

We develop theoretical guarantees for TTOI. In particular, we introduce a series of representation lemmas for tensor matricizations with TT format. Based on them, we develop a deterministic upper bound of estimation error for both forward and backward updates in TTOI iterations. Under the benchmark setting of spiked tensor model, we develop matching upper/lower bounds and prove that the proposed TTOI algorithm achieves the minimax optimal rate of estimation error. To the best of our knowledge, this is the first statistical optimality results for high-order tensors with TT format. We also prove for any high-order tensor, TTOI iteration has monotone decreasing approximation error with respect to the iteration index.

Moreover, to break the curse of dimensionality in high-order Markov processes, we study the state aggregatable high-order Markov processes and establish a key connection to TT decomposable tensors. We propose a TTOI estimator for the transition probability tensor in high-order state-aggregatable Markov processes and establish the theoretical guarantee. We conduct simulation experiments to demonstrate the performance of TTOI and validate our theoretical findings. We also apply our method to analyze a New York taxi dataset. By modeling taxi trips as trajectories realized from a citywide Markov chain, we found that the Manhattan traffic zone exhibits high-order Markovian dependence and the proposed TTOI reveals latent traffic patterns and meaningful partition of Manhattan traffic zones. Finally, we discuss several applications that our proposed algorithm is applicable to, including transition probability tensor estimation in high-order Markov decision processes and joint probability tensor estimation in Markov random fields.

C. Related Literature

In addition to the aforementioned literature on TT decomposition, our work is also related to a substantial body of work on matrix/tensor decomposition and SVD, spiked tensor model, etc. These literature are from a range of communities including applied mathematics, information theory, machine learning, scientific computing, signal processing, and statistics. Here we try to review existing literature in these communities without claiming this literature survey is exhaustive.

First, the matrix singular value thresholding was commonly used and extensively studied in various problems in data science, including matrix denoising [37], [38], [39], matrix completion [40], [41], [42], [43], principal component analysis (PCA) [44], Markov chain state aggregation [45]. Such the task was also widely considered for tensors of order-3 or higher. In particular, to perform SVD and decomposition for tensors with Tucker low-rank structures, [46], [47] introduced the higher-order SVD (HOSVD) and higher-order orthogonal iteration (HOOI). [48] established the statistical and computational limits of tensor SVD, compared the theoretical properties of HOSVD and HOOI, and proved that HOOI achieves both statistical and computational optimality. [49] introduced the sequentially truncated higher-order singular value decomposition (ST-HOSVD). [50] introduced a thresholding & projection based algorithm for sparse tensor SVD. A non-exhaustive list of methods for SVD and decomposition for tensors with CP low-rank structures include alternating least squares [51], [52], eigendecomposition-based approach [53], enhanced line search [54], power iteration with SVD-based initialization [6], simultaneous diagonalization and higher-order SVD [55].

In addition, the spiked tensor model and tensor principal component analysis (tensor PCA) are widely discussed in the literature. [56], [57], [58], [59], [60], [61] considered the statistical and computational limits of rank-1 spiked tensor model. [62] studied the statistical and computational phase transitions and theoretical properties of the approximate message passing algorithm (AMP) under a Bayesian spiked tensor model. [63], [64] developed the regularization-based methods for tensor PCA. [65], [66], [67], [68] studied the robust tensor PCA to handle the possible outliers from the tensor observation.

Different from Tucker and CP decompositions, which have been a pinpoint in the enormous existing literature on tensors, we focus on the TT-structure associated with high-order tensors for the following reasons: (1) Tucker and CP decompositions do not involve the sequential structure of different modes, i.e., the Tucker and CP decompositions still hold if the d modes are arbitrarily permuted. While in applications such as high-order Markov process, high-order Markov decision process, and fully connected layers of deep neural networks, the order of different modes can be crucial; (2) the number of entries involved in the low-Tucker-rank parameterization grows exponentially with respect to the order d (r^d); (3) methods that explore CP low-rank structure can be numerically unstable for high-order tensors in computation as pointed out by [27]. In comparison, the TT-structure incorporates the order of different modes sequentially and involves much fewer parameters for high-order tensors, which renders it more suitable in many scenarios.

In Section V, we will further discuss the application of TTOI on high-order Markov processes and state aggregation. This problem is related to a body of literature on dimension reduction and state aggregation for Markov processes that we will discuss in Section V.

D. Organization

The rest of the article is organized as follows. In Section II, after a brief introduction of the notation and preliminaries, we introduce the procedure of the tensor-train orthogonal iteration. The theoretical results, including three representation lemmas, a general estimation error bound, and the minimax optimal upper and lower bounds under the spiked tensor model, are provided in Sections III and IV. The application to high-order Markov chains is discussed in Section V. The simulation and real data analysis are provided in Sections VI-A and VI-B, respectively. Discussions and further applications to Markov random fields and high-order Markov decision processes are briefly discussed in Section VII. All technical proofs are provided in Section A.

II. Procedure of Tensor-Train Orthogonal Iteration

A. Notation and Preliminaries

We first introduce the notation and preliminaries to be used throughout the paper. We use the lowercase letters, e.g., x, y, z, to denote scalars or vectors. We use C, c, C₀, c₀, … to denote generic constants, whose actual values may change from line to line. A random variable z is σ-sub-Gaussian if $E e^{t (z - E z)} \leq e^{σ^{2} t^{2} / 2}$ for any $t \in ℝ$ . We say a ≲ b or a = O(b) if a ≲ Cb for some uniform constant C > 0. We write $a = \tilde{O} (b)$ if a = O(b log^C′(b)) for constant C′ > 0. The capital letters, e.g., X, Y, Z, are used to denote matrices. Specifically, $O_{p, r} ≔ {U \in ℝ^{p \times r} : U^{⊤} U = I_{r}}$ is the set of all p-by-r matrices with orthogonal columns. For $U \in O_{p, r}$ , let $U_{⊥} \in O_{p, p - r}$ be the orthonormal complement of U, and let P_U = UU^⊤ denote the projection matrix onto the column space of U. For any matrix $A \in ℝ^{p_{1} \times p_{2}}$ , let $A = \sum_{i = 1}^{p_{1} \land p_{2}} s_{i} u_{i} v_{i}^{⊤}$ be the singular value decomposition, where $s_{1} (A) \geq \dots \geq s_{p_{1} \land p_{2}} (A) \geq 0$ are the singular values of A in non-increasing order. Define $s_{min} (A) = s_{p_{1} \land p_{2}} (A)$ , ${SVD}_{r}^{L} (A) = [u_{1} \dots u_{r}] \in O_{p_{1}, r}$ , and ${SVD}_{r}^{R} (A) = [v_{1} \dots v_{r}] \in O_{p_{2}, r}$ be the smallest non-trivial singular value, leading r left singular vectors, and leading r right singular vectors of A, respectively. We also write ${SVD}^{L} (A) = {SVD}_{p_{1} \land p_{2}}^{L} (A)$ and ${SVD}^{R} (A) = {SVD}_{p_{1} \land p_{2}}^{L} (A)$ as the collection of all left and right singular vectors of A, respectively. Define the Frobenius and spectral norms of A as $‖ A ‖_{F} = \sqrt{\sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} A_{i j}^{2}} = \sqrt{\sum_{i = 1}^{p_{1} \land p_{2}} s_{i}^{2} (A)}$ and $‖ A ‖ = s_{1} (A) = {max}_{x \in ℝ^{p_{2}}} ‖ A x ‖_{2} / ‖ x ‖_{2}$ . For any two matrices $U \in ℝ^{m_{1} \times n_{1}}$ and $V \in ℝ^{m_{2} \times n_{2}}$ , let

U \otimes V = [\begin{matrix} U_{11} \cdot V & \dots & U_{1 n_{1}} \cdot V \\ ⋮ & ⋮ \\ U_{m_{1} 1} \cdot V & \dots & U_{m_{1} n_{1}} \cdot V \end{matrix}] \in ℝ^{(m_{1} m_{2}) \times (n_{1} n_{2})}

be their Kronecker product. To quantify the distance among subspaces, we define the principle angles between U, $\hat{U} \in O_{p, r}$ as an r-by-r diagonal matrix: $Θ (U, \hat{U}) = diag (arccos (s_{1}), \dots, arccos (s_{r}))$ , where s₁ ≥ ⋯ ≥ s_r ≥ 0 are the singular values of $U^{⊤} \hat{U}$ . Define the sinΘ norm as

‖ sin Θ (U, \hat{U}) ‖ = ‖ diag (sin (arccos (s_{1})), \dots, sin (arccos (s_{r}))) ‖ = \sqrt{1 - s_{r}^{2}} .

The boldface calligraphic letters, e.g., $X$ , $Y$ , $Z$ , are used to denote tensors. For an order-d tensor1 $X \in ℝ^{\otimes_{i = 1}^{d} p_{i}}$ and 1 ≤ k ≤ d − 1, we define ${[X]}_{k} \in ℝ^{(p_{1} \times \dots \times p_{k}) \times (p_{k + 1} \dots p_{d})}$ as the sequential unfolding of $X$ with rows enumerating all indices in Modes 1, …, k and columns enumerating all indices in Modes (k + 1), ⋯, d, respectively. That is, for any 1 ≤ k ≤ d and 1 ≤ i_k ≤ p_k,

{({[X]}_{k})}_{ξ_{1} (i_{1}, \dots, i_{d}; k), ξ_{2} (i_{1}, \dots, i_{d}; k)} = X_{i_{1} \dots i_{d}},

where ξ₁(i₁, …, i_d; k) = (i_k − 1)p₁ ⋯ p_k−1 + (i_k−1 − 1)p₁ ⋯ p_k−2 + ⋯ + i₁ and ξ₂(i₁, …, i_d; k) = (i_d − 1)p_k+1 ⋯ p_d−1 + (i_d−1 − 1)p_k+1 ⋯ p_d−2 + ⋯ + i_k+1. Following the convention of reshape function in MATLAB, we define the reshape of any matrix X of dimension p₁ ⋯ p_k × p_k+1 ⋯ p_d as an inverse operation of tensor matricization: $X = Reshape (X, p_{1}, p_{2}, \dots, p_{d})$ if $X = {[X]}_{k}$ . For any two matrices $A \in ℝ^{q_{1} \times q_{2} q_{3}}$ and $\tilde{A} \in ℝ^{q_{1} q_{2} \times q_{3}}$ , we denote $\tilde{A} = Reshape (A, q_{1} q_{2}, q_{3})$ and $A = Reshape (\tilde{A}, q_{1}, q_{2} q_{3})$ if and only if

{\tilde{A}}_{(i_{2} - 1) p_{1} + i_{1}, i_{3}} = A_{i_{1}, (i_{3} - 1) p_{2} + i_{2}}, \forall 1 \leq i_{j} \leq q_{j}, j = 1, 2, 3.

We also define the tensor Frobenius norm of $X$ as $‖ X ‖_{F}^{2} = \sum_{i_{1} = 1}^{p_{1}} \dots \sum_{i_{d} = 1}^{p_{d}} X_{i_{1}, \dots, i_{d}}^{2}$ . For any matrix $A \in ℝ^{p_{1} \times p_{2}}$ and any tensor $B \in ℝ^{p_{1} \times \dots \times p_{d}}$ , let vec(A) and $vec (B)$ be the vectorization of A and $B$ , respectively. Formally, for any 1 ≤ k ≤ d and 1 ≤ i_k ≤ p_k,

{(vec (B))}_{(i_{d} - 1) p_{1} \dots p_{d - 1} + (i_{d - 1} - 1) p_{1} \dots p_{d - 2} + \dots + i_{1}} = B_{i_{1}, \dots, i_{d}} .

B. Procedure of Tensor-Train Orthogonal Iteration

We are now in position to introduce the procedure of Tensor-Train Orthogonal Iteration (TTOI). The pseudocode of the overall procedure is given in Algorithm 1. TTOI includes three main parts: we first run initialization, then perform backward update and forward update alternatively and iteratively.

II.

Part 1: Initialization. First, we obtain an initial estimate of TT-cores $G_{1}, G_{2}, \dots, G_{d - 1}, G_{d}$ . This step is the tensor-train-singular value decomposition (TT-SVD) originally introduced by [1].
1. Let $R_{1}^{(0)}$ be the unfolding of $Y$ along Mode 1. We compute the top-r₁ SVD of $R_{1}^{(0)}$ . Let ${\hat{U}}_{1}^{(0)} \in O_{p_{1}, r_{1}}$ be the first r₁ left singular vectors of $R_{1}^{(0)}$ and calculate ${\tilde{R}}_{1}^{(0)} = {({\hat{U}}_{1}^{(0)})}^{⊤} R_{1}^{(0)} \in ℝ^{r_{1} \times (p_{2} \dots p_{d})}$ . Then, ${\hat{U}}_{1}^{(0)}$ is an initial estimate of the subspace that G₁ lies in and ${\tilde{R}}_{1}^{(0)}$ can be seen as the projection residual.
2. Next, we realign the entries of ${\tilde{R}}_{1}^{(0)} \in ℝ^{r_{1} \times (p_{2} \dots p_{d})}$ to $R_{2}^{(0)} \in ℝ^{(r_{1} p_{2}) \times (p_{3} \dots p_{d})}$ , where the rows and columns of $R_{2}^{(0)}$ correspond to indices of Modes-1, 2 and Modes-3, …, d, respectively. Then, we evaluate the top-r₂ SVD of $R_{2}^{(0)}$ . Let ${\hat{U}}_{2}^{(0)}$ be the first r₂ left singular vectors of $R_{2}^{(0)}$ and evaluate ${\tilde{R}}_{2}^{(0)} = {({\hat{U}}_{2}^{(0)})}^{⊤} R_{2}^{(0)} \in ℝ^{r_{2} \times p_{3} \dots p_{d}}$ . Again, ${\hat{U}}_{2}^{(0)}$ is an estimate of the singular subspace that $G_{2}$ lies on and ${\tilde{R}}_{2}^{(0)}$ is the projection residual for the next calculation.
3. We apply Step (ii) on ${\tilde{R}}_{2}^{(0)}$ to obtain ${\hat{U}}_{3}^{(0)} \in O_{r_{2} p_{3}, r_{3}}$ and ${\tilde{R}}_{3}^{(0)} \in ℝ^{r_{3} \times (p_{4} \dots p_{d})}$ ; …; apply Step (ii) on ${\tilde{R}}_{d - 2}^{(0)}$ to obtain ${\hat{U}}_{d - 1}^{(0)} \in O_{r_{d - 2} p_{d - 1}, r_{d - 1}}$ and ${\tilde{R}}_{d - 1}^{(0)} \in ℝ^{r_{d - 1} \times p_{d}}$ . Then we reshape matrix ${\hat{U}}_{k}^{(0)} \in ℝ^{(p_{k} r_{k - 1}) \times r_{k}}$ to tensor ${\hat{U}}_{k}^{(0)} \in ℝ^{r_{k - 1} \times p_{k} \times r_{k}}$ for k = 2, …, d − 1. Now, $({\hat{U}}_{1}^{(0)}, {\hat{U}}_{2}^{(0)}, \dots, {\hat{U}}_{d - 1}^{(0)}, {\tilde{R}}_{d - 1}^{(0) ⊤})$ yield the initial estimates of TT-cores of $X$ and we expect that
  $X \approx X^{(0)} = 〚 {\hat{U}}_{1}^{(0)}, {\hat{U}}_{2}^{(0)}, \dots, {\hat{U}}_{d - 1}^{(0)}, {\tilde{R}}_{d - 1}^{(0)} 〛 .$
The initialization step is summarized to Algorithm 1(a) and illustrated in Figure 2. In summary, we perform SVD on some “residual” $R_{k}^{(0)}$ sequentially for k = 1, …, d − 1. As will be shown in Lemma III.3, $R_{k}^{(0)}$ satisfies
$R_{k}^{(0)} = (I_{p_{k}} \otimes {\hat{U}}_{k - 1}^{(0) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{(0) ⊤}) {[Y]}_{k},$
where ${[Y]}_{k} \in ℝ^{(p_{1} \dots p_{k}) \times (p_{k + 1} \dots p_{d})}$ is the kth sequential unfolding of $Y$ (see definition in Section II-A). This quantity plays a key role in the backward update next.

The initialization step mainly focuses on the left singular spaces of ${[X]}_{k}$ while ignoring the information included in the right singular spaces. Due to this fact, we develop the following new backward update that utilizes both the left and right singular space estimates from the previous step to refine our estimates. Similarly, we can also perform a forward update to further improve the outcome of backward update, and then iteratively alternate between backward and forward updates. The detailed descriptions of these two updates are presented as follows, and a further explanation is given in Remark II.1.
Part 2: Backward update. For iterations t = 1, 3, 5, …, we perform backward update, i.e., to sequentially obtain ${\hat{V}}_{d}^{(t)}, \dots, {\hat{V}}_{2}^{(t)}$ based on the intermediate results from the (t − 1)st iteration (0th iteration is the initialization). The pseudocode of backward update is provided in Algorithm 1(b). The calculation in Algorithm 1(b) is equivalent to
${\hat{V}}_{d}^{(t)} = {SVD}^{R} ({\tilde{R}}_{d - 1}^{(t - 1)}),$

${\hat{V}}_{k}^{(t)} = {SVD}^{R} ({\tilde{R}}_{k - 1}^{(t - 1)} ({\hat{V}}_{d}^{(t)} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1}^{(t)} \otimes I_{p_{k}}))$

for k = d − 1, …, 2, and
${\hat{V}}_{1}^{(t)} = {[Y]}_{1} ({\hat{V}}_{d}^{(t)} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3}^{(t)} \otimes I_{p_{2}}) {\hat{V}}_{2}^{(t)} \in ℝ^{p_{1} \times r_{1}} .$
Here,
${\tilde{R}}_{k}^{(t - 1)} = {({\hat{U}}_{k}^{(t - 1)})}^{⊤} (I_{p_{k}} \otimes {\hat{U}}_{k - 1}^{(t - 1) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{(t - 1) ⊤}) {[Y]}_{k}$
are the projection residual term in the intermediate outcome of the (t − 1)st iteration. Then, we reshape ${\hat{V}}_{k}^{(t) ⊤} \in ℝ^{r_{k - 1} \times (p_{k} r_{k})}$ to ${\hat{V}}_{k}^{(t)} \in ℝ^{r_{k - 1} \times p_{k} \times r_{k}}$ . The backward up-dated estimate is
${\hat{X}}^{(t)} = 〚 {\hat{V}}_{1}^{(t)}, {\hat{V}}_{2}^{(t)}, \dots, {\hat{V}}_{d - 1}^{(t)}, {\hat{V}}_{d}^{(t)} 〛$

Remark II.1 (Interpretation of backward update). The backward updates utilize and extract the right singular vectors of the intermediate products of the (t − 1)st iteration,
${\tilde{R}}_{k}^{(t - 1)} = {({\hat{U}}_{k}^{(t - 1)})}^{⊤} (I_{p_{k}} \otimes {\hat{U}}_{k - 1}^{(t - 1) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{(t - 1) ⊤}) {[Y]}_{k},$
as opposed to the entire data ${[Y]}_{k}$ . Such a dimension reduction scheme is the key to the backward update: it can simultaneously reduce the dimension of the matrix of interest, ${[Y]}_{k}$ , and the noise therein, while preserving the signal strength. Different from the initialization in Step 1, the backward update utilizes the information from both the forward and backward singular subspaces of the tensor-train structure of $X$ . See Section III for more illustration.
Part 3: Forward Update. For iteration t = 2, 4, 6, …, we perform forward update, i.e., to sequentially obtain ${\hat{U}}_{1}^{(t)}, \dots, {\hat{U}}_{d}^{(t)}$ based on the intermediate results from the (t − 1)st iteration. Essentially, the forward update can be seen as a reversion of the backward update by flipping all modes of tensor $Y$ . The pseudocode of this procedure is collected in Algorithm 1(c). Recall ${[Y]}_{1} ({\hat{V}}_{d}^{(t - 1)} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3}^{(t - 1)} \otimes I_{p_{2}}) {\hat{V}}_{2}^{(t - 1)}$ is the intermediate product from the (t − 1)st update. We sequentially compute
${\hat{U}}_{1}^{(t)} = {SVD}^{L} ({[Y]}_{1} ({\hat{V}}_{d}^{(t - 1)} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3}^{(t - 1)} \otimes I_{p_{2}}) {\hat{V}}_{2}^{(t - 1)});$

${\hat{U}}_{1}^{(t)} = {SVD}^{L} ((I_{p_{k}} \otimes {\hat{U}}_{k - 1}^{(t) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{(t) ⊤}) {[Y]}_{k} \cdot ({\hat{V}}_{d}^{(t - 1)} \otimes I_{p_{k + 1} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 2}^{(t - 1)} \otimes I_{p_{k + 1}}) {\hat{V}}_{k + 1}^{(t - 1)})$
for k = 2, …, d − 1, and
${\hat{U}}_{d}^{(t)} = {[{({\hat{U}}_{d - 1}^{(t)})}^{⊤} (I_{p_{d - 1}} \otimes {({\hat{U}}_{d - 2}^{(t)})}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {({\hat{U}}_{1}^{(t)})}^{⊤}) {[Y]}_{d - 1}]}^{⊤} \in ℝ^{p_{d} \times r_{d - 1}} .$

Reshape ${\hat{U}}_{k}^{(t)} \in ℝ^{(p_{k} r_{k - 1}) \times r_{k}}$ to ${\hat{U}}_{k}^{(t)} \in ℝ^{r_{k - 1} \times p_{k} \times r_{k}}$ for k = 2, …, d − 1. Then, compute
${\hat{X}}^{(t)} = 〚 {\hat{U}}_{1}^{(t)}, {\hat{U}}_{2}^{(t)}, \dots, {\hat{U}}_{d - 1}^{(t)}, {\hat{U}}_{d}^{(t)} 〛$

We will explain the algebraic schemes in the TTOI procedure through several representation lemmas in Section III-A. We will also show in Theorem III.2 that the objective function ${‖ Y - {\hat{X}}^{(t)} ‖}_{F}^{2}$ is monotone decreasing with respect to the iteration index t. In the large-scale scenarios that performing iterations is beyond the capacity of computing, we can reduce the number of iterations, and even to t_max = 1, i.e., the one-step iteration, which have often yielded sufficiently accurate estimation as we will illustrate in both theory and simulation studies. Such the phenomenon has been recently discovered for HOOI in the Tucker low-rank tensor decomposition [69].

Remark II.2 (Computational and storage costs of TTOI). We consider the computational and storage costs of TTOI on the p-dimensional, rank-r, order-d, and dense tensor. Since computing the first r singular vectors of an m × n matrix via block power method requires $\tilde{O} (m n r)$ operations, initialization costs $\tilde{O} (p^{d} r)$ operations, each iteration of TTOI, including forward and backward updates, costs O(p^dr). Therefore, the total number of operations of TTOI with T iterations is $\tilde{O} (p^{d} r) + O (T p^{d} r)$ , which is not significantly more than the number of elements of the target tensor. Moreover, TTOI requires O(p^d) storage cost, which is not significantly more than the storage cost of the original tensor.

Fig. 2. — A Pictorial Illustration of Initialization (Algorithm 1(a), d = 3)

III. Theoretical Analysis

This section is devoted to the theoretical analysis of the proposed procedure. For convenience, we introduce the following two abbreviations for matrix sequential products: for $M_{i} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}}$ , 1 ≤ i ≤ d − 1 and $B_{j} \in ℝ^{(r_{j} p_{j}) \times r_{j - 1}}$ , 2 ≤ j ≤ d, we denote

M_{prod, k}^{(L)} = (I_{p_{2} \dots p_{k}} \otimes M_{1}) \dots (I_{p_{k}} \otimes M_{k - 1}) M_{k} \in ℝ^{(p_{1} \dots p_{k}) \times r_{k}}, \forall 1 \leq k \leq d - 1,

B_{prod, k}^{(R)} = (B_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots (B_{k + 1} \otimes I_{p_{k}}) B_{k} \in ℝ^{(p_{k} \dots p_{d}) \times r_{k - 1}}, \forall 2 \leq k \leq d .

Equivalently, $M_{prod, k}^{(L)}$ and $B_{prod, k}^{(R)}$ can be defined sequentially as

M_{prod, 1}^{(L)} = M_{1}, M_{prod, k + 1}^{(L)} = (I_{p_{k + 1}} \otimes M_{prod, k}^{(L)}) M_{k + 1}, 1 \leq k \leq d - 2,

B_{prod, d}^{(R)} = B_{d}, B_{prod, k}^{(R)} = (B_{prod, k + 1}^{(R)} \otimes I_{p_{k}}) B_{k}, 2 \leq k \leq d - 1.

A. Representation Lemmas for high-order tensors

Since the computation of high-order tensors with tensor-train structures involves extensive tensor algebra, we introduce the following three lemmas on the matrix representation of high-order tensors. These lemmas play a fundamental role in the later theoretical analysis.

Lemma III.1 (Representation for sequential matricization of TT-decomposable tensor). Suppose $X = 〚 G_{1}, G_{2}, \dots, G_{d - 1}, G_{d} 〛$ . Then the sequential matricization of $X$ can be written as

{[X]}_{k} = (I_{p_{2} \dots p_{k}} \otimes G_{1}) (I_{p_{3} \dots p_{k}} \otimes {[G_{2}]}_{2}) \dots (I_{p_{k}} \otimes {[G_{k - 1}]}_{2}) \cdot {[G_{k}]}_{2} {[G_{k + 1}]}_{1} ({[G_{k + 2}]}_{1} \otimes I_{p_{k + 1}}) \dots ({[G_{d - 1}]}_{1} \otimes I_{p_{k + 1} \dots p_{d - 2}}) (G_{d}^{⊤} \otimes I_{p_{k + 1} \dots p_{d - 1}}) .

(4)

Lemma III.2 (Representation of tensor reshaping). For any tensor $T \in ℝ^{\otimes_{k = 1}^{d} p_{k}}$ and 1 ≤ i < j ≤ d − 1, we have

{[T]}_{j} = (I_{p_{i + 1} \dots p_{j}} \otimes {[T]}_{i}) A^{(p_{i + 1} \dots p_{j}, p_{j + 1} \dots p_{d})},

{[T]}_{i} = A^{(p_{i + 1} \dots p_{j}, p_{1} \dots p_{i}) ⊤} ({[T]}_{j} \otimes I_{p_{i + 1} \dots p_{j}})

Here, we define $e_{k}^{(i j)}$ as the kth canonical basis of $ℝ^{i j}$ and

A^{(i, j)} = [\begin{matrix} e_{1}^{(i j)} & e_{i + 1}^{(i j)} & \dots & e_{i (j - 1) + 1}^{(i j)} \\ e_{2}^{(i j)} & e_{i + 2}^{(i j)} & \dots & e_{i (j - 1) + 2}^{(i j)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ e_{i}^{(i j)} & e_{2 i}^{(i j)} & \dots & e_{i j}^{(i j)} \end{matrix}] \in ℝ^{(i^{2} j) \times j} .

(5)

Lemmas III.1 and III.2 can be proved by checking each entry of the corresponding matricizations. In addition, the following lemma provides a representation of sequential reshaping tensor, in particular for $R_{k}^{(t)}$ and ${\tilde{R}}_{k}^{(t)}$ , the key intermediate outcomes in TTOI procedure.

Lemma III.3 (Representation of sequential reshaping tensor). Suppose $T \in ℝ^{\otimes_{k = 1}^{d} p_{k}}$ , $M_{i} \in ℝ^{(r_{i - 1} p_{i}) \times r_{i}}$ for 1 ≤ i ≤ d − 1, $B_{i} \in ℝ^{(p_{i} r_{i}) \times r_{i - 1}}$ for 2 ≤ i ≤ d, where r₀ = r_d = 1. Consider the following sequential multiplication:

Forward sequential multiplication: Let $S_{1} = {[T]}_{1}$ . For k = 1, …, d − 1, calculate

{\tilde{S}}_{k} = M_{k}^{⊤} S_{k} \in ℝ^{r_{k} \times (p_{k + 1} \dots p_{d})},

S_{k + 1} = Reshape ({\tilde{S}}_{k}, r_{k} p_{k + 1}, p_{k + 2} \dots p_{d}) if k < d - 1.

Then for any 1 ≤ k ≤ d − 1,

S_{k} = (I_{p_{k}} \otimes M_{prod, k - 1}^{(L) ⊤}) {[T]}_{k}, {\tilde{S}}_{k} = M_{prod, k}^{(L) ⊤} {[T]}_{k} .

(6)

Here, $I_{p_{k}} \otimes M_{prod, k - 1}^{(L) ⊤} = I_{p_{1}}$ if k = 1.

Backward sequential multiplication: Let $W_{d - 1} = {[T]}_{d - 1}$ . For k = d − 1, …, 1, calculate

{\tilde{W}}_{k} = W_{k} B_{k + 1} \in ℝ^{(p_{1} \dots p_{k}) \times r_{k}}

W_{k - 1} = Reshape ({\tilde{W}}_{k}, p_{1} \dots p_{k - 1}, p_{k} r_{k}) if k > 1.

Then for any 1 ≤ k ≤ d − 1,

W_{k} = {[T]}_{k} (B_{prod, k + 2}^{(R)} \otimes I_{p_{k + 1}}), {\tilde{W}}_{k} = {[T]}_{k} B_{prod, k + 1}^{(R)} .

Here, $B_{prod, k + 2}^{(R)} \otimes I_{p_{k + 1}} = I_{p_{d}}$ if k = d − 1.

In particular, $R_{k}^{(0)}$ , ${\tilde{R}}_{k}^{(0)}$ in Algorithm 1(a) and $R_{k}^{(t)}$ , ${\tilde{R}}_{k}^{(t)} (t \in {2, 4, 6, \dots})$ in Algorithm 1(c) satisfy

R_{k}^{(t)} = (I_{p_{k}} \otimes {({\hat{U}}^{(t)})}_{prod, k - 1}^{(L) ⊤}) {[Y]}_{k}, {\tilde{R}}_{k}^{(t)} = {({\hat{U}}^{(t)})}_{prod, k}^{(L) ⊤} {[Y]}_{k}, \forall 1 \leq k \leq d - 1.

(7)

The proof of Lemma III.3 is provided in Section A–H.

B. Deterministic Upper Bounds for Estimation Error of TTOI

Now we are in position to analyze the performance of TTOI. The following Theorem III.1 introduces an upper bound on estimation error of ${\hat{X}}^{(2 t + 1)}$ (backward update) and ${\hat{X}}^{(2 t + 2)}$ (forward update).

Theorem III.1. Suppose we observe $Y = X + Z$ , where $X$ admits a TT decomposition as (1).

(A deterministic estimation error bound for backward updates) Let ${\tilde{U}}_{1}^{(2 t)} = U_{1} \in ℝ^{p_{1} \times r_{1}}$ be the left singular space of ${[X]}_{1}$ . For 2 ≤ k ≤ d − 1, define ${\tilde{U}}_{k}^{(2 t)} \in ℝ^{p_{k} r_{k - 1} \times r_{k}}$ as the left singular subspace of $(I_{p_{k}} \otimes {({\hat{U}}^{(2 t)})}_{prod, k - 1}^{(L) ⊤}) {[X]}_{k}$ . If for some constant c₀ ∈ (0, 1),

‖ sin Θ ({\hat{U}}_{k}^{(2 t)}, {\tilde{U}}_{k}^{(2 t)}) ‖ \leq c_{0}, \forall 1 \leq k \leq d - 1,

(8)

then there exists a constant C_d > 0 that only depends on d such that the outcome of Algorithm 1(b) satisfies

{‖ {\hat{X}}^{(2 t + 1)} - X ‖}_{F}^{2} \leq C_{d} (\sum_{k = 1}^{d - 1} A_{k}^{(2 t + 1)} + B^{(2 t + 1)}),

(9)

where

A_{k}^{(2 t + 1)} = {‖ {({\hat{U}}^{(2 t)})}_{prod, k}^{(L) ⊤} {[Z]}_{k} ({({\hat{V}}^{(2 t + 1)})}_{prod, k + 2}^{(R)} \otimes I_{p_{k + 1}}) ‖}_{F}^{2},

B^{(2 t + 1)} = {‖ {[Z]}_{1} {({\hat{V}}^{(2 t + 1)})}_{prod, 2}^{(R)} ‖}_{F}^{2} .

Here, ${({\hat{V}}^{(2 t + 1)})}_{prod, k + 2}^{(R)} \otimes I_{p_{k + 1}} = I_{p_{d}}$ if k = d − 1.

(A deterministic estimation error bound for forward updates) For 2 ≥ k ≤ d − 1, let ${\tilde{V}}_{k}^{(2 t + 1)} \in ℝ^{(p_{k} r_{k}) \times r_{k - 1}}$ be the right singular space of ${[X]}_{k - 1} ({({\hat{V}}^{(2 t + 1)})}_{prod, k + 1}^{(R)} \otimes I_{p_{k}})$ and let ${\tilde{V}}_{d}^{(2 t + 1)} = V_{d} \in ℝ^{p_{d} \times r_{d - 1}}$ be the right singular space of ${[X]}_{d - 1}$ . If for some constant c₀ ∈ (0, 1),

‖ sin Θ ({\hat{V}}_{k}^{(2 t + 1)}, {\tilde{V}}_{k}^{(2 t + 1)}) ‖ \leq c_{0}, \forall 2 \leq k \leq d,

then there exists a constant C_d > 0 that only depends on d such that the outcome of Algorithm 1(c) satisfies

{‖ {\hat{X}}^{(2 t + 2)} - X ‖}_{F}^{2} \leq C_{d} (\sum_{k = 1}^{d - 1} A_{k}^{(2 t + 2)} + B^{(2 t + 2)}),

(10)

where

A_{k}^{(2 t + 2)} = {‖ (I_{p_{k}} \otimes {({\hat{U}}^{(2 t + 2)})}_{prod, k - 1}^{(L) ⊤}) {[Z]}_{k} {({\hat{V}}^{(2 t + 1)})}_{prod, k + 1}^{(R)} ‖}_{F}^{2},

B^{(2 t + 2)} = {‖ {({\hat{U}}^{(2 t + 2)})}_{prod, d - 1}^{(L) ⊤} {[Z]}_{d - 1} ‖}_{F}^{2} .

Here, $I_{p_{k}} \otimes {({\hat{U}}^{(2 t + 2)})}_{prod, k - 1}^{(L) ⊤} = I_{p_{1}}$ if k = 1.

The proof of Theorem III.1 is provided in Section A-A. Theorem III.1 shows the estimation error ${‖ {\hat{X}}^{(t + 1)} - X ‖}_{F}^{2}$ can be bounded by the projected noise $Z$ , i.e., $A_{k}^{(t + 1)}$ and B^(t+1), if the estimates in initialization (t = 0) or the previous iteration (t ≥ 1), ${{\hat{U}}_{k}^{(t)}}_{k = 1}^{d - 1}$ or ${{\hat{V}}_{k}^{(t)}}_{k = 2}^{d}$ , are within constant distance to the true underlying subspaces. The developed upper bound can be significantly smaller than $C ‖ Z ‖_{F}^{2}$ , the classic upper bound induced from the approximation error (e.g., Theorem 2.2 in [1]), especially in the high-dimensional setting (p ≫ r).

Remark III.1 (Interpretation of error bounds in Theorem III.1). Here, we provide some explanation for $A_{k}^{(2 t + 1)}$ and B^(2t+1) in the error bound (9). By algebraic calculation, the TT-core estimation via backward update can be written as

{\hat{V}}_{k + 1}^{(2 t + 1)} = {SVD}^{R} {{({\hat{U}}^{(2 t)})}_{prod, k}^{(L) ⊤} ({[X]}_{k} + {[Z]}_{k}) \cdot ({({\hat{V}}^{(2 t + 1)})}_{prod, k + 2}^{(R)} \otimes I_{p_{k + 1}})}

for any 1 ≤ k ≤ d − 1 and

{\hat{V}}_{1}^{(2 t + 1)} = ({[X]}_{1} + {[Z]}_{1}) {({\hat{V}}^{(2 t + 1)})}_{prod, 2}^{(R)} .

From the definition of $A_{k}^{(2 t + 1)}$ , we have see $A_{k}^{(2 t + 1)}$ quantifies the error of the singular subspace estimate ${\hat{V}}_{k + 1}^{(2 t + 1)}$ and B^(2t+1) quantifies the error of the projected residual ${\hat{V}}_{1}^{(2 t + 1)}$ . By symmetry, similar interpretation also applies to $A_{k}^{(2 t + 2)}$ and B^(2t+2) for the error bound of forward update (10).

Remark III.2 (Proof Sketch of Theorem III.1). While the complete proof of Theorem III.1 is provided in Section A-A, we provide a brief proof sketch here.

Without loss of generality, we focus on (9) for t = 0 while other cases follows similarly. For convenience, we simply let ${\hat{U}}_{i}$ , ${\hat{V}}_{i}$ denote ${\hat{U}}_{i}^{(0)}$ , ${\hat{V}}_{i}^{(1)}$ , respectively. First, by Lemma III.1, we can transform ${[{\hat{X}}^{(1)}]}_{1}$ , the outcome of backward update, to

{[{\hat{X}}^{(1)}]}_{1} = {[Y]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}} .

Then we can further bound the estimation error of ${\hat{X}}^{(1)}$ as

{‖ {\hat{X}}^{(1)} - X ‖}_{F}^{2} \leq C {‖ {[Z]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2} ‖}_{F}^{2} + C_{d} \sum_{k = 2}^{d} {‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) \cdot ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F}^{2} .

Next, based on Lemma III.2 and (8), we can prove

{‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) \cdot ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F} = {‖ {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq C_{d} {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} .

Finally, we apply the perturbation projection error bound (Lemma A.3) to prove that

C_{d} {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq C_{d} {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) ‖}_{F} .

Theorem (IV.1) is proved by combing all inequalities above.

Next, we establish a decomposition formula for the approximation error, i.e., the objective function in (3) ${‖ Y - X^{(t)} ‖}_{F}^{2}$ , and show that the approximation error is monotone decreasing through TTOI iterations.

Theorem III.2 (Approximation error decays through iterations). We implement TTOI on $Y$ . Let ${\hat{X}}^{(t)}$ be the outcome after the tth iteration. For any k ≥ 1, we have

(Approximation error decay) ‖ Y ‖_{F}^{2} - {‖ {\hat{X}}^{(t + 1)} ‖}_{F}^{2} \leq ‖ Y ‖_{F}^{2} - {‖ {\hat{X}}^{(t)} ‖}_{F}^{2},

(11)

(Approximation error decomposition) {‖ Y - {\hat{X}}^{(t + 1)} ‖}_{F}^{2} = ‖ Y ‖_{F}^{2} - {‖ {\hat{X}}^{(t + 1)} ‖}_{F}^{2} .

(12)

See Section A–B for the proof of Theorem III.2.

IV. TTOI for Tensor-Train Spiked Tensor Model

In this section, we further focus on a probabilistic setting, spiked tensor model, where the noise tensor $Z$ has independent, mean zero, and σ-sub-Gaussian entries (see definition in Section II-A). The spiked tensor model has been widely studied as a benchmark setting for tensor PCA/SVD and dimension reduction in recent literature in machine learning, information theory, statistics, and data science [62], [61], [60], [70], [48]. The central goal therein is to discover the underlying low-rank tensor $X$ . Most of the existing works focused on tensors with Tucker or CP decomposition.

Under the spiked tensor model, we can verify that the initialization step of TTOI gives sufficiently good initial estimations with high probability that matches the required condition in Theorem III.1.

Theorem IV.1 (Probabilistic bound for initial estimates and projected noise). Suppose $X$ is TT-decomposable as (1) and $Z$ have independent zero mean and σ-sub-Gaussian random variables. Denote p = min{p₁, ⋯, p_d}. If there exists a constant C_gap such that $λ_{k} = s_{r_{k}} ({[X]}_{k}) \geq C_{g a p} ({(\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i})}^{1 / 2} + {(p_{k + 1} \dots p_{d})}^{1 / 2}) σ$ for 1 ≤ k ≤ d − 1, then there exist some constants C, c > 0 and C_d > 0 that only depends on d, with probability at least 1 − C exp(−cp),

max_{k = 1, \dots, d - 1} ‖ sin Θ ({\hat{U}}_{k}^{(0)}, {\tilde{U}}_{k}^{(0)}) ‖ \leq \frac{1}{2},

(13)

max_{\begin{matrix} k = 1, \dots, d - 1 \\ t = 2, 4, 6, \dots \end{matrix}} ‖ sin Θ ({\hat{U}}_{k}^{(t)}, {\tilde{U}}_{k}^{(t)}) ‖ \leq \frac{1}{2}, max_{\begin{matrix} k = 2, \dots, d \\ t = 1, 3, 5, \dots \end{matrix}} ‖ sin Θ ({\hat{V}}_{k}^{(t)}, {\tilde{V}}_{k}^{(t)}) ‖ \leq \frac{1}{2},

(14)

and for all t ≥ 1,

max {A_{k}^{(t)}, B^{(t)}} \leq C_{d} σ^{2} \sum_{i = 1}^{d} p_{i} r_{i} r_{r - 1} .

(15)

Here, ${\tilde{U}}_{k}^{(t)}$ , ${\tilde{V}}_{k}^{(t)}$ , $A_{k}^{(t)}$ and B^(t) are defined in Theorem III.1.

The proof of Theorem IV.1 is provided in Section A–C. Based on Theorems III.1 and IV.1, we can further prove:

Corollary IV.1 (Upper bound for estimation error). Suppose $X$ can be decomposed as (1), $Z_{i_{1}, \dots, i_{d}}$ are independent zero mean and σ-sub-Gaussian random variables, p = min{p₁, ⋯, p_d}. Suppose there exists a constant C_gap such that $λ_{k} = s_{r_{k}} ({[X]}_{k}) \geq C_{g a p} ({(\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i})}^{1 / 2} + {(p_{k + 1} \dots p_{d})}^{1 / 2}) σ$ for 1 ≤ k ≤ d − 1. Then with probability at least 1 − Ce^−cp, for all t ≥ 1,

{‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} \leq C_{d} σ^{2} \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} .

(16)

The proof of Corollary IV.1 is provided in Section A–D.

Remark IV.1 (Interpretation of Corollary IV.1). Note that the TT-cores G₁, $G_{i}$ , G_d respectively have p₁r₁, p_ir_ir_i−1, p_dr_d−1 free parameters, the upper bound (16) can be seen as the noise level σ² times the degrees of freedom of the low TT rank tensors.

Next, we develop a minimax lower bound for the low TT rank structure estimation. Consider the following general class of tensors with dimension p = (p₁, …, p_d) and TT rank r = (r₁, …, r_d−1),

F_{p, r} (λ) = {X \in ℝ^{p_{1} \times \dots \times p_{d}}, X can be decomposed as (1), s_{r_{k}} ({[X]}_{k}) \geq λ_{k}, 1 \leq k \leq d - 1},

(17)

and a class of distributions of σ-sub-Gaussian noise tensors

D = {D : if Z ~ D, then Z_{i_{1}, \dots, i_{d}} are indep. zero mean  and σ sub-Gaussian random variables} .

(18)

Here, the constraints on the least singular value of ${[X]}_{k}$ and the σ-sub-Gaussian assumption correspond to the conditions required for upper bound in Theorem IV.1.

Theorem IV.2 (Lower bound). Consider the order-d TT spiked tensor model (2) and distribution class $D$ in (18). Assume p = min{p₁, …, p_d} ≥ C₀ for some large constant C₀, r₁ ≤ p₁/2, r_i ≤ p_ir_i−1/2, r_i−1 ≤ p_ir_i/2 for 2 ≤ i ≤ d − 1, r_d−1 ≤ p_d, and λ_i > 0. Also assume r₁r₂ ≤ p₁ if d = 3. Then there exists a constant c_d > 0 that only depends on d such that

inf_{\hat{X}} sup_{X \in F_{p, r} (λ), D \in D} E_{Z ~ D} ‖ \hat{X} - X ‖_{F}^{2} \geq c_{d} σ^{2} \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} .

(19)

See Section A–E for the proof of Theorem IV.2.

V. TTOI for Dimension Reduction and State Aggregation in High-order Markov Chain

Since the introduction at the beginning of the 20th century, the Markov process has been ubiquitous in a variety of disciplines. In the literature, the first order Markov process, i.e., the future observation at (t + 1) is conditionally independent of those at times 1, …, (t − 1) given the immediate past observation at time t, has been commonly used and extensively studied. Moreover, the high-order Markov process often appear in many scenarios, where the future observation is affected by a longer history. For example, in the taxi travel trajectory, the future stop of a taxi not only depends on the current location but also the past path that reveals the direction this taxi is heading to [71]. The high-order Markov processes have also been applied to inter-personal relationship [72], financial econometrics [73], traffic flow [74], among many other applications.

We specifically consider an ergodic, time-invariant, and (d − 1)st order Markov process on a finite state space {1, …, p}. That is, the future state X_t+d depends on the current state X_t+d−1 and the previous (d − 2) states (X_t+d−2, …, X_t+1) jointly:

ℙ (X_{t + d} ∣ X_{1}, \dots, X_{t + d - 1}) = ℙ (X_{t + d} ∣ X_{t + 1}, \dots, X_{t + d - 1}) = P_{[X_{t + 1}, \dots, X_{t + d}]} .

(20)

Our goal is to achieve a reliable estimation of the transition tensor $P$ and to predict the future state X_t+d based on an observable trajectory. Since the total number of free parameters in a (d − 1)st order Markov transition tensor $P$ is O(p^d) without further assumptions, it may be prohibitively difficult to infer $P$ in both statistics and computation even if p and d are only of moderate scale. Instead, a sufficient dimension reduction for high-order Markov processes is in demand.

To enable the statistical inference and dimension reduction for high-order Markov processes, a powerful tool, mixed transition distribution model (MTD), was introduced [72]. The MTD model assumes that the distribution of future state is a linear combination of the distributions associated with the (d − 1) immediate past states. The readers are also referred to [75] for a survey on mixed transition distribution model. The linear assumption, however, does not take into account the potential interactions of past states that commonly appear in practice. For example in the New York taxi trip data, the interaction among past locations of a taxi indicates its potential future direction.

On the other hand, there is a recent surge of development in dimension reduction and state aggregation for first order Markov chains. For example, [76] considered the Markov chain aggregation and the application to biology; [77] considered the rank-reduced Markov model and mode clustering; [45] considered Markov rank, aggregagability, and lumpability of Markov processes and proposed the dimension reduction and state aggregation methods through spectral decomposition with theoretical guarantees; [78] proposed clustering block model and proposed efficient algorithm to solve it; [79] introduced a convex and non-convex methods to estimate the rank-reduced low-rank Markov transition matrix.

Inspired by these work, we propose and study the state aggregation model for the discrete-time high-order Markov processes as follows.

Definition V.1 ((d − 1)st order state aggregatable Markov1 process). Suppose there exist maps $G_{1} : [p] \to ℝ^{r_{1}}$ , $G_{k} : [p] \times ℝ^{r_{k - 1}} \to ℝ^{r_{k}}$ , $G_{d} : [p] \times ℝ^{r_{d - 1}} \to ℝ$ such that G₂, …, G_d are linear: G_k(X, λ₁u + λ₂v) = λ₁G_k(X, u) + λ₂G_k(X, v) for any vectors u, v, scalars $λ_{1}, λ_{2} \in ℝ$ . We say a Markov process {X₁, X₂, …} is (d − 1)st order state aggregatable if for all t ≥ 0, the transition can be sequentially generated as follows,

{\tilde{P}}_{1} (X_{t + 1}) = G_{1} (X_{t + 1}) \in ℝ^{r_{1}}, {\tilde{P}}_{k} (X_{t + 1}, \dots, X_{t + k}) = G_{k} (X_{t + k}, {\tilde{P}}_{k - 1} (X_{t + 1}, \dots, X_{t + k - 1})) \in ℝ^{r_{k}}, k = 2, \dots, d - 1, ℙ (X_{t + d} ∣ X_{1}, \dots, X_{t + d - 1}) = ℙ (X_{t + d} ∣ X_{t + 1}, \dots, X_{t + d - 1}) = G_{d} (X_{t + d}, {\tilde{P}}_{d - 1} (X_{t + 1}, \dots, X_{t + d - 1})) .

In a (d − 1)st order state aggregatable Markov process, the future state X_t+d relies on a sequential aggregation of the previous d − 1 states X_t+1, …, X_t+d−1 as follows: we first project X_t+1 to a r₁-dimensional vector ${\tilde{P}}_{1} (X_{t + 1})$ via G₁, then project ${\tilde{P}}_{1} (X_{t + 1})$ jointly with X_t+2 to a r₂-dimensional vector ${\tilde{P}}_{1} (X_{t + 1}, X_{t + 2})$ via G₂. We repeat such the projection sequentially for X_t+3, …, X_t+d and yield the transition probability $ℙ (X_{t + d} ∣ X_{t + 1}, \dots, X_{t + d - 1})$ . Also, see Figure 4 for a pictorial illustration.

Fig. 4. — A pictorial illustration of a (d − 1)st order state aggregatable Markov chain

Based on the definition of the state aggregatable Markov chain, we can prove the corresponding probability transition tensor $P$ will have low TT rank.

Proposition V.1. The transition tensor $P$ of the rank reduced high-order Markov model in Definition V.1 has TT-rank no more than (r₁, …, r_d−1). In other words, $P$ satisfies $rank ({[P]}_{k}) \leq r_{k}$ .

The proof of Proposition V.1 is provided in Section A–F.

Next, we focus on a synchronous or generative setting, which can be seen as a high-order generalization of the classic observation model for the analysis of Markov (decision/reward) processes (see [80] for an introduction), for the high-order Markov process. To be specific, for each sample index k = 1, …, n and previous states (i₁, …, i_d−1) ∈ [p]^d−1, suppose we observe the next state X(i₁, …, i_d−1; k) drawn from the Markov transition tensor $P$ . It is natural to estimate $P$ via the empirical transition tensor: for i₁, …, i_d ∈ {1, …, p}^d,

{\hat{P}}_{i_{1}, \dots, i_{d}}^{emp} = \sum_{k = 1}^{n} 1_{{X (i_{1}, \dots, i_{d - 1}; k) = i_{d}}} / n .

Then, ${\hat{P}}^{emp}$ is an unbiased estimator of $P$ . However, if the entries of $P$ are approximately balanced, the mean squared error of ${\hat{P}}^{emp}$ satisfies

E {‖ {\hat{P}}^{emp} - P ‖}_{F}^{2} = \sum_{i_{1}, \dots, i_{d}} Var ({\hat{P}}_{i_{1}, \dots, i_{d}}^{emp}) = \sum_{i_{1}, \dots, i_{d - 1}} \sum_{i_{d}} \frac{ℙ (i_{d} ∣ i_{1}, \dots, i_{d - 1}) (1 - ℙ (i_{d} ∣ i_{1}, \dots, i_{d - 1}))}{n} ≍ \frac{p^{d - 1}}{n},

(21)

To obtain a more accurate estimator, we propose to first perform TTOI on ${\hat{P}}^{emp}$ to obtain ${\hat{P}}^{(1)}$ , then project each row of ${[{\hat{P}}^{(1)}]}_{d - 1}$ , or equivalently, each mode-d fiber of ${\hat{P}}^{(1)}$ , onto the simplex $S^{p - 1} = {x \in ℝ^{p} : \sum_{i = 1}^{p} x_{i} = 1, x_{i} \geq 0 for all 1 \leq i \leq p}$ via probability simplex projection (see an implementation in [81]) and obtain $\hat{P}$ .

We establish an upper bound on estimation error for the TTOI estimator $\hat{P}$ .

Proposition V.2. Consider the synchronous or generative model for a (d − 1)st order state aggregatable Markov process described above. Suppose the initialization condition (8) in Theorem III.1 holds. Then with probability at least 1 − Ce^−cp, the output of one-step TTOI followed by the probability simplex projection satisfies

‖ \hat{P} - P ‖_{F}^{2} \leq C (max_{1 \leq i \leq d - 1} r_{i}) \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} / n .

The proof of Proposition V.2 is provided in Section A–G. Compared to the estimation error rate of ${\hat{P}}^{emp}$ in (21), Proposition V.2 shows TTOI achieves significantly reduced estimation error by exploiting the low TT rank structure of the high-order Markov process.

Remark V.1. If the observations form one transition trajectory {X₀, …, X_N}, we can work on the following empirical transition tensor:

{\hat{P}}_{i_{1}, \dots, i_{d}}^{emp} = \frac{\sum_{t = 0}^{N - d + 1} 1_{{X_{t} = i_{1}, \dots, X_{t + d - 1} = i_{d}}}}{\sum_{t = 0}^{N - d + 1} 1_{{X_{t} = i_{1}, \dots, X_{t + d - 2} = i_{d - 1}}}}, i f \sum_{t = 1}^{N - d + 1} 1_{{X_{t} = i_{1}, \dots, X_{t + d - 2} = i_{d - 1}}} > 0; {\hat{P}}_{i_{1}, \dots, i_{d}}^{emp} = 1 / p, i f \sum_{t = 1}^{N - d + 1} 1_{{X_{t} = i_{1}, \dots, X_{t + d - 2} = i_{d - 1}}} = 0.

(22)

Then ${\hat{P}}^{emp}$ can be a nearly unbiased and strongly consistent estimator for $P$ . When the Markov process is (d − 1)st order state aggregatable, we can apply TTOI to obtain a better estimate. As will be explored by numerical studies in Section VI-A, the TTOI estimator achieves favorable performance on the estimation of $P$ .

VI. Numerical Studies

In this section, we investigate the numerical performance of TTOI.

A. Simulation

In each simulation setting, we present the numerical results in both average estimation error (denoted by dots) and standard deviation (denoted by bars) based on 100 repetitions. We assume the true TT-ranks are known in the first three settings. Afterwards, we introduce a BIC-type data-driven scheme for TT-rank selection and present its numerical performance. All experiments are conducted by a quad-core 2.3 GHz Intel Core i5 processor.

We first consider the tensor-train spiked tensor model (2) discussed in Section IV. Specifically, we randomly generate $G_{1}, G_{2}, \dots, G_{d - 1}, G_{d}$ with i.i.d. standard normal entries, and generate $Z$ with i.i.d. $N (0, σ^{2})$ or Unif(−b, b) entries. Let p₁ = ⋯ = p_d = p, r₁ = ⋯ = r_d−1 = r, and consider four settings: (1) p = 100, d = 3, r = 1; (2) p = 50, d = 4, r = 1; (3) p = 20, d = 5, r = 1; (4) p = 20, d = 5, r = 2. For varying values of σ ∈ [1, 19] and b ∈ [3, 30], we evaluate the estimation error ${‖ {\hat{X}}^{(t)} - X ‖}_{F}$ of the TT-SVD and TTOI estimators with 1 or 2 iterations, i.e., t_max = 0, 1, 2. From the results summarized in Figure 5 (normal noise) and Figure 6 (uniform noise), we can see TTOI, even with one iteration, performs significantly better than TT-SVD, and the advantage becomes more significant as the noise level σ, b grows. This suggests that the proposed TTOI is effective for high-order tensor SVD compared to the classic TT-SVD, especially when the observations are corrupted by substantial noise. Table I summarizes the runtime of TT-SVD and TTOI, which suggests that the additional computational cost incurred by the backward and forward updates in TTOI is negligible compared to the runtime of the original TT-SVD.

Fig. 5. — Estimation error of TT-SVD and TTOI for high-order spiked tensor model. Here, $Z \overset{i.i.d.}{~} N (0, σ^{2})$ .

Fig. 6. — Estimation error of TT-SVD and TTOI for high-order spiked tensor i.i.d. model. Here, $Z \overset{i.i.d.}{~} Unif (- b, b)$ .

TABLE I.

Runtime (in seconds) of TT-SVD, TTOI with 1 iteration, and TTOI with 2 iterations under the high-order spiked tensor model with $Z \overset{I.I.D.}{~} N (0, 400)$ . The mean runtime of 50 independent replicates are presented and the standard deviations are listed in parentheses.

(p, d, r)	TT-SVD	TTOI (t_max = 1)	TTOI (t_max = 2)
(100, 3, 1)	0.332 (0.071)	0.334 (0.071)	0.340 (0.074)
(50, 4, 1)	1.165 (0.173)	1.169 (0.172)	1.201 (0.171)
(20, 5, 1)	0.725 (0.093)	0.730 (0.092)	0.751 (0.095)
(20, 5, 2)	0.672 (0.100)	0.676 (0.101)	0.708 (0.103)

Open in a new tab

To understand the influence of TT-rank to the performance of the TT-SVD and TTOI estimators, we conduct numerical experiments under the spiked tensor model (2) with r₁ = ⋯ = r_d−1 = r for various values of r. In particular, $G_{1}, G_{2}, \dots, G_{d - 1}, G_{d}$ are still generated with i.i.d. standard normal entries, and $Z$ has i.i.d. $N (0, σ^{2})$ entries. Letting p₁ = ⋯ = p_d = p, we consider two settings: (1) p = 100, d = 3, σ = 20; (2) p = 500, d = 3, σ = 100. For r = 1, …, 10, we evaluate the average estimation error ${‖ {\hat{X}}^{(t)} - X ‖}_{F}$ of TT-SVD, TTOI with 1 iteration, and TTOI with 2 iterations (i.e., t_max = 0, 1, 2), and present the results in Figure 7. Figure 7 suggests that the estimation errors increase as the rank increases, while TTOI with 1 or 2 iterations both performs better than TT-SVD. The improvement of TTOI over TT-SVD is more significant under larger p or smaller r. An intuitive explanation for this phenomenon is as follows: the key idea of TTOI is to utilize the previous updates to reduce the dimension of the sequential unfolding ${[Y]}_{k}$ before performing singular value thresholding; such the dimension reduction is more significant for large p or small r.

Fig. 7. — Estimation error of TT-SVD and TTOI for high-order spiked tensor model with varying TT-ranks

Next, we demonstrate the performance of TTOI on transition tensor estimation for the high-order state-aggregatable Markov chains studied in Section V. We consider the (d − 1)st order Markov chain on p states. To generate the transition tensor $P$ , we first draw ${\tilde{G}}_{1} \in ℝ^{p \times r}$ , ${\tilde{G}}_{2} \in ℝ^{r \times p \times r}, \dots, {\tilde{G}}_{d} \in ℝ^{r \times p}$ with i.i.d. standard normal entries, then normalize the rows of ${\tilde{G}}_{1}, {\tilde{G}}_{2}, \dots, {\tilde{G}}_{d}$ in absolute values as

G_{1, [i, j]} = \frac{| {\tilde{G}}_{1, [i, j]} |}{\sum_{j^{'}} ∣ {\tilde{G}}_{1, [i, j^{'}]}]}, G_{k, [i_{1}, i_{2}, j]} = \frac{| {\tilde{G}}_{k, [i_{1}, i_{2}, j]} |}{\sum_{j^{'}} | {\tilde{G}}_{k, [i_{1}, i_{2}, j^{'}]} |}, G_{d, [i, j]} = \frac{| {\tilde{G}}_{d, [i, j]} |}{\sum_{j^{'}} | {\tilde{G}}_{d, [i, j^{'}]} |} .

By this means, $P = 〚 G_{1}, G_{2}, \dots, G_{d - 1}, G_{d} 〛$ satisfies $P_{i_{1}, \dots, i_{d}} \geq 0$ , $\sum_{i_{d} = 1}^{p} P_{i_{1}, \dots, i_{d}} = 1$ for any (i₁, …, i_d−1), so $P$ forms a Markov transition tensor. To generate the trajectory {X₁, …, X_N}, we generate the initial d − 1 states X₁, …, X_d−1 i.i.d. uniformly from [p], then generate X_d, …, X_N sequentially according to (20). To estimate $P$ , we construct the empirical probability tensor ${\hat{P}}^{emp}$ by (22), then apply TT-SVD and TTOI with input ${\hat{P}}^{emp}$ as detailed in Section V to obtain $\hat{P}$ . We consider two numerical settings: (1) p = 100, d = 3, r = 1; (2) p = 50, d = 4, r = 1. We evaluate the estimation error ${‖ {\hat{P}}^{(i)} - P ‖}_{F}$ for each setting and summarize the results to Figure 8. Again, TTOI exhibits clear advantage over the existing methods in all simulation settings.

Fig. 8. — Estimation error of the transition tensor versus length of the observable trajectory in high order state-aggregatable Markov chain estimation.

Selection of TT-ranks. The proposed TTOI algorithm requires specifying TT-ranks r₁, …, r_d−1 as inputs and the appropriate choices of r₁, …, r_d−1 are crucial in practice. We propose a data-driven scheme to select the TT-ranks: we choose r₁, …, r_d−1 ≥ 1 such that the following Bayesian information criterion (BIC) under the spiked tensor model is minimized:

BIC (r_{1}, \dots, r_{d - 1}) ≔ \prod_{k = 1}^{d} p_{k} log {‖ Y - \hat{X} (r_{1}, \dots, r_{d - 1}) ‖}_{F}^{2} + (p_{1} r_{1} + \sum_{k = 2}^{d - 1} p_{k} r_{k - 1} r_{k} + p_{d} r_{d - 1}) (\sum_{k = 1}^{d} log p_{k}) .

(23)

Here, $\hat{X} (r_{1}, \dots, r_{d - 1})$ is the output of TTOI (Algorithm 1) with the input TT-ranksb r₁, …, r_d−1. This BIC-type criterion was also adopted in prior works on tensor clustering [82].

Then we conduct numerical experiments under the same setting as the bottom two plots in Figure 5 on the spiked tensor model with Gaussian noise. Figure 9 summarizes the estimation errors of TT-SVD and TTOI with 1 and 2 iterations, respectively, with the ranks selected based on the proposed BIC criterion (23). Comparing Figure 9 to the bottom two plots in Figure 5, we can see the proposed criterion can select the true ranks accurately and the performance of both TT-SVD and TTOI with tuned ranks is very similar to the one by inputting the true ranks.

Fig. 9. — Average estimation error of TT-SVD and TTOI for high-order spiked tensor model with BIC-tuned ranks.

B. Real Data Experiments

We apply the proposed method to investigate the Manhattan taxi data. This dataset contains the New York City taxi trip records from 14,144 drivers in 2013. We treat each travel record as a transition among different locations at New York City, then the overall dataset can be organized as a collection of fragmented sample trajectories of a Markov chain on New York City traffic. Some recent analysis on such data can be seen at, e.g., [71], [83], [45].

Due to the high-dimensional spatiotemporal nature of the dataset, a sufficient dimension reduction or state aggregation is often a crucial first step to study a metropolitan-wide traffic pattern. To this end, we apply the high-order Markov model as described in Section V. Specifically, we discretize the Manhattan region into a grid of p = 119 states that forms a state space. Then, we collect all travel records in Manhattan of each driver from the dataset, sort them by time, and form into Markovian transition trajectories. In particular, each travel record is treated as a transition from the pickup to the drop-off location. If the drop-off location i of the previous trip is different from the pickup location j of the next trip by the same driver, we also form a transition from states i to j. Based on the trajectories, we can construct a high-order Markov chain with an order d empirical transition probability tensor ${\hat{P}}^{emp} \in ℝ^{\otimes_{k = 1}^{d} p}$ as described in Section V. Assuming the true probability tensor is state aggregatable (Definitionb V.1), we apply one-step TTOI proposed in Section V and obtain $\hat{P}$ . It is noteworthy if d = 2, the described procedure of $\hat{P}$ is equivalent to the classic matrix spectral decomposition in the literature. Figure 10 plots the singular values of the sequential unfolding matrices of ${\hat{P}}^{emp}$ for d = 3, which clearly demonstrates the low-TT-rankness of the probability transition tensor $P$ . In the following experiments, we focus on the order-2 Markov model and analyze all consecutive two transitions: i → j → k, corresponding to the d = 3 case.

Fig. 10. — Singular values of sequential unfolding matrices ${[{\hat{P}}^{emp}]}_{1}$ (left panel) and ${[{\hat{P}}^{emp}]}_{2}$ (right panel)

Inspired by the classic methods of matrix spectral decomposition, we aggregate all location states in Manhattan into a few clusters via both $\hat{P}$ and ${\hat{P}}^{emp}$ . Specifically, we calculate ${\hat{G}}_{d}^{⊤}$ , i.e., the last TT-core of $\hat{P}$ , and ${[{\hat{P}}^{emp}]}_{d - 1}$ , i.e., the matricization of ${\hat{P}}^{emp}$ whose columns correspond to the last mode. Then we perform k-means on all columns of ${\hat{G}}_{d}^{⊤}$ and ${[{\hat{P}}^{emp}]}_{d - 1}$ , record the cluster index, associate the index to each location state, and plot the results in Figure 11 (Panels (a)(b) are for TTOI and Panels (c)(d) are for empirical estimate). From Figure 11 (a)(b), we can clearly identify four regions: (i) lower Manhattan (orange), (ii) midtown (dark blue), (iii) upper west side (green), and (iv) upper east side (brown or black). In contrast, direct clustering on $P^{emp}$ yields less interpretable results as the majority points go to one cluster. It is also worth noting even the location information is not provided to this experiment, the resulting clusters in Figures 11 (a)(b) show good spatial proximity between locations. This illustrates the effectiveness of TTOI in dimension reduction and state-aggregation for high-order Markov processes.

Fig. 11. — State aggregation based on TTOI and empirical estimate

Next, we illustrate the high-order nature of the city-wide taxi trip through the following experiment. For each initial state i ∈ [p], we apply k-means to cluster the column span of ${\hat{P}}_{[i, :, :]}$ , where $\hat{P}$ is the outcome of TTOI. We present the results in Figure 12, where the red triangles denote the given first state i and r = k = 7. If the city-wide taxi trips do not have significant high-order effects, $\hat{P}$ should be reducible to a first order Markov process and ${\hat{P}}_{[i, :, :]}$ should have similar values for different i. However, as we can see from Figure 12 that the clustering results highly depends on the first state i, the high-order effects exist in the city-wide taxi trip Markov process. In addition, the states in different directions of i are often clustered to different regions, which shows that the taxi drivers may tend to move to the same direction in consecutive trips, which yields the high-order effects in the driving trajectories.

Fig. 12. — Based on second order Markov model, state aggregation results are different with different initial state (the red triangle denotes the initial state i in each subfigure)

VII. Discussions and Additional Applications

In this paper, we propose a general framework for high-order SVD. We introduce a novel procedure, tensor-train orthogonal iteration (TTOI), that efficiently estimates the low tensor train rank structure from the high-order tensor observation. TTOI has significant advantages over the classic ones in the literature. We establish a general deterministic error bound for TTOI with the support of several new representation lemmas for tensor matricizations. Under the commonly studied spiked tensor model, we establish an upper bound for TTOI and a matching information-theoretic lower bound. We also illustrate the merits of TTOI through simulation studies and a real data example in New York City taxi trips.

In addition to the high-order Markov processes, the proposed TTOI can also be applied to the Markov random field (MRF) estimation. We give a brief description of MRF below. Consider an undirected graph G = (V, E), where V = {1, …, d} is a set of vertices and E ⊆ V × V is a collection of edges. Each vertex i ∈ V is associated with a random variable X_i, taking values in {s₁, …, s_p}. In an MRF model, the distribution of {X₁, …, X_d} can be factorized as

ℙ (X_{1}, \dots, X_{d}) = \frac{1}{Z} \prod_{C \in C} ψ_{C} (X_{C}),

where $C$ is a collection of subgraphs of G and X_C = (X_v, v ∈ C) denotes the random vector corresponding to vertices in C. The joint probability function $ℙ (\cdot)$ can be written as a tensor $P \in ℝ^{\otimes_{k = 1}^{d} p}$ , where $P_{i_{1}, \dots, i_{d}} = ℙ (X_{1} = s_{i_{1}}, \dots, X_{d} = s_{i_{d}})$ . The MRFs have a wide range of applications, including image analysis [84], [85], genomic study [86], and natural language processing [87]. The readers are referred to, e.g., [88] for an introduction to MRFs.

A central problem of MRF is how to estimate the population density $P$ based on a limited number of samples ${X_{1}^{(i)}, \dots, X_{d}^{(i)}}_{i = 1}^{n}$ . It is straightforward to estimate $P$ via the empirical probability tensor ${\hat{P}}^{emp}$ :

{\hat{P}}_{i_{1}, \dots, i_{d}}^{emp} = \sum_{i = 1}^{n} \prod_{k = 1}^{d} 1 (X_{k}^{(i)} = s_{i_{k}}) / n .

We can show that ${\hat{P}}^{emp}$ is unbiased for $P$ . Recently, [17] pointed out that $P$ is often approximately low tensor-train rank in practice. To further exploit such the structure, we can conduct TTOI on ${\hat{P}}^{emp}$ . Under regularity conditions, it can be shown that the entries of $Z$ are bounded and weakly independent, then Corollary IV.1 suggests the following estimation error rate of the TTOI estimator: $‖ \hat{P} - P ‖_{F}^{2} \leq C \sum_{i = 1}^{d} r_{i} r_{i - 1} / (n p^{2 d - 1})$ , which can be significantly smaller than the estimation error of original empirical estimator ${\hat{P}}^{emp}$ .

Moreover, the proposed framework can be also applied to high-order Markov decision process (high-order MDP). MDP has been commonly used as a baseline in control theory and reinforcement learning [89], [90], [91], [92]. Despite the wide applications of MDPs, most of the existing work focus on the first-order Markov processes. However, the high-order effects often appear, i.e., the transition probability at the current time depends not only on current, but also the past (d − 1) states and actions. See Figure 13 for an example. Since the number of free parameters in such MDPs can be huge, a sufficient dimension reduction for the state and action space can be a crucial first step. Similarly to the example of high-order Markov process in Section V, the TTOI can be applied to achieve better dimension reduction and state aggregation for the high-order Markov decision processes.

Fig. 13. — Illustration of a high-order state aggregatable Markov decision process

Fig. 3. — A pictorial illustration of TT-Backward update (Algorithm 1(b), d = 3)

Acknowledgments

The research of Yuchen Zhou and Anru R. Zhang were supported in part by NSF under Grants CAREER-1944904, DMS-1811868, and NIH under Grant R01 GM131399; the research of Yazhen Wang was supported in part by NSF under Grants DMS-1707605 and DMS-1913149. This work was done while Yuchen Zhou, Anru R. Zhang, and Lili Zheng were at the University of Wisconsin-Madison.

Biographies

Yuchen Zhou is a postdoctoral researcher in the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania. He received the B.E. degree from Peking University in 2016 and the Ph.D. degree in statistics from the University of Wisconsin-Madison in 2021. His research interests include high-dimensional statistical inference, tensor data analysis, reinforcement learning and statistical learning theory.

Anru R. Zhang is the Eugene Anson Stead, Jr. M.D. Associate Professor in the Department of Biostatistics & Bioinformatics and Associate Professor in the Departments of Computer Science, Mathematics, and Statistical Science at Duke University. He was an assistant professor of statistics at the University of Wisconsin-Madison in 2015–2021. He obtained his bachelors degree from Peking University in 2010 and his Ph.D. from the University of Pennsylvania in 2015. His work focuses on high-dimensional statistical inference, non-convex optimization, statistical tensor analysis, computational complexity, and applications in genomics, microbiome, electronic health records, and computational imaging. He received the ASA Gottfried E. Noether Junior Award (2021), a Bernoulli Society New Researcher Award (2021), an ICSA Outstanding Young Researcher Award (2021), and an NSF CAREER Award (2020).

Lili Zheng is a postdoctoral researcher in the Department of Electrical and Computer Engineering at Rice University. She received her bachelor’s degree from University of Science and Technology of China (USTC) in 2016 and her Ph.D. degree in statistics from University of Wisconsin - Madison in 2021. Her research interests span dependent data, high-dimensional statistics, network analysis, tensor modeling, stochastic algorithms, and non-convex optimization.

Yazhen Wang is Chair and Professor of Statistics at the University of Wisconsin-Madison. He obtained his Ph.D in statistics from University of California at Berkeley in 1992. He is the fellows of Institute of Mathematical Statistics (IMS) and American Statistical Association (ASA). He served on numerous professional committees of ASA, IMS and ICSA, as NSF program director, editors of Statistics Sinica and Statistics and Its Interface, and associate editors of various journals including Annals of Statistics, Annals of Applied Statistics, Journal of the American Statistical Association, Journal of Business and Economic Statistics, and Statistica Sinica. His research interests are financial econometrics, quantum computation, machine learning, high dimensional statistics, nonparametric curve estimation, wavelets, change points, long-memory processes, and order restricted inference.

Appendix A

Proofs

We collect all technical proofs of this paper in this section.

A. Proof of Theorem III.1

For convenience, let ${\hat{U}}_{i}$ , ${\hat{V}}_{i}$ , R_i and ${\tilde{R}}_{i}$ denote ${\hat{U}}_{i}^{(0)}$ , ${\hat{V}}_{i}^{(1)}$ , $R_{i}^{(0)}$ and ${\tilde{R}}_{i}^{(0)}$ , respectively. By Lemma III.1 and

I_{p_{2} \dots p_{d}} - P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}} = P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2 ⊥}} + P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{4} \otimes I_{p_{2} p_{3}}) ({\hat{V}}_{3 ⊥} \otimes I_{p_{2}})} + \dots + P_{{\hat{V}}_{d ⊥} \otimes I_{p_{2} \dots p_{d - 1}}},

we have

{‖ {\hat{X}}^{(1)} - X ‖}_{F}^{2} = {‖ [{[Y]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}] \cdot {\hat{V}}_{2}^{⊤} ({\hat{V}}_{3}^{⊤} \otimes I_{p_{2}}) \dots ({\hat{V}}_{d}^{⊤} \otimes I_{p_{2} \dots p_{d - 1}}) - {[X]}_{1} ‖}_{F}^{2} = {‖ {[Z]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}} + {[X]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}} - {[X]}_{1} ‖}_{F}^{2} \leq C ({‖ {[Z]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2}} ‖}_{F}^{2} + {‖ {[X]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2 ⊥}} ‖}_{F}^{2} + {‖ {[X]}_{1} P_{({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{4} \otimes I_{p_{2} p_{3}}) ({\hat{V}}_{3 ⊥} \otimes I_{p_{2}})} ‖}_{F}^{2} + \dots + {‖ {[X]}_{1} P_{{\hat{V}}_{d ⊥} \otimes I_{p_{2} \dots p_{d - 1}}} ‖}_{F}^{2}) \leq C ({‖ {[Z]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2} ‖}_{F}^{2} + {‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2 ⊥} ‖}_{F}^{2} + {‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{4} \otimes I_{p_{2} p_{3}}) ({\hat{V}}_{3 ⊥} \otimes I_{p_{2}}) ‖}_{F}^{2} + \dots + {‖ {[X]}_{1} ({\hat{V}}_{d ⊥} \otimes I_{p_{2} \dots p_{d - 1}}) ‖}_{F}^{2}) .

(24)

To prove (9), we only need to show that for all 2 ≤ k ≤ d,

{‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) \cdot ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F} \leq C {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) ‖}_{F},

(25)

where

{[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) = {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3} \otimes I_{p_{2}}) {\hat{V}}_{2 ⊥}

if k = 2 and

{[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) = {[X]}_{1} ({\hat{V}}_{d ⊥} \otimes I_{p_{2} \dots p_{d - 1}})

if k = d.

By Lemma III.2, we have

{‖ {[X]}_{1} ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) \cdot ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F} = {‖ {[A^{(p_{2} \dots p_{k - 1}, p_{1})}]}^{⊤} ({[X]}_{k - 1} \otimes I_{p_{2} \dots p_{k - 1}}) ({\hat{V}}_{d} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{2} \dots p_{k}}) ({\hat{V}}_{k ⊥} \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F} = {‖ {[A^{(p_{2} \dots p_{k - 1}, p_{1})}]}^{⊤} \cdot (({[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥}) \otimes I_{p_{2} \dots p_{k - 1}}) ‖}_{F} = {‖ {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} .

(26)

The third equation holds since the realignment doesn’t change the Frobenious norm.

Moreover, recall that $U_{1} \in ℝ^{p_{1} \times r_{1}}$ is the left singular space of ${[X]}_{1}$ , and ${\tilde{U}}_{j} \in ℝ^{p_{j} r_{j - 1} \times r_{j}}$ is the left singular space of $(I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}$ for 2 ≤ j ≤ d − 1, by Lemma III.2, for any 2 ≤ k ≤ d − 1,

{[X]}_{k} = (I_{p_{2} \dots p_{k}} \otimes {[X]}_{1}) A^{(p_{2} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{2} \dots p_{k}} \otimes P_{U_{1}} {[X]}_{1}) A^{(p_{2} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{2} \dots p_{k}} \otimes P_{U_{1}}) (I_{p_{2} \dots p_{k}} \otimes {[X]}_{1}) A^{(p_{2} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{2} \dots p_{k}} \otimes P_{U_{1}}) {[X]}_{k},

(27)

and for any 2 ≤ j < k,

(I_{p_{j} \dots p_{k}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} \dots p_{k}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k} = (I_{p_{j} \dots p_{k}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} \dots p_{k}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{⊤}) \cdot (I_{p_{j + 1} \dots p_{k}} \otimes {[X]}_{j}) A^{(p_{j + 1} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{j + 1} \dots p_{k}} \otimes [(I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}]) \cdot A^{(p_{j + 1} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{j + 1} \dots p_{k}} \otimes [P_{{\tilde{U}}_{j}} (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}]) \cdot A^{(p_{j + 1} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{j + 1} \dots p_{k}} \otimes P_{{\tilde{U}}_{j}}) (I_{p_{j} \dots p_{k}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} \dots p_{k}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{⊤}) (I_{p_{j + 1} \dots p_{k}} \otimes {[X]}_{j}) A^{(p_{j + 1} \dots p_{k}, p_{k + 1} \dots p_{d})} = (I_{p_{j + 1} \dots p_{k}} \otimes P_{{\tilde{U}}_{j}}) (I_{p_{j} \dots p_{k}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} \dots p_{k}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k},

(28)

where A^(i,j) is defined in (5) for any i, j > 0. Therefore, by (27),

{‖ {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} = {‖ (I_{p_{2} \dots p_{k - 1}} \otimes P_{U_{1}}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} = {‖ (I_{p_{2} \dots p_{k - 1}} \otimes U_{1}^{⊤}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq {‖ (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) (I_{p_{2} \dots p_{k - 1}} \otimes U_{1}) (I_{p_{2} \dots p_{k - 1}} \otimes U_{1}^{⊤}) \cdot {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} ((I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) (I_{p_{2} \dots p_{k - 1}} \otimes U_{1})) = {‖ (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} ({\hat{U}}_{1}^{⊤} U_{1}) .

(29)

The inequality holds since $‖ B ‖_{F} \leq ‖ A B ‖_{F} \cdot s_{min}^{- 1} (A)$ for any invertible matrix $A \in ℝ^{m_{1} \times m_{1}}$ and $B \in ℝ^{m_{1} \times m_{2}}$ ; in the last step, we used $(I_{p_{2} \dots p_{k - 1}} \otimes U_{1}) (I_{p_{2} \dots p_{k - 1}} \otimes U_{1}^{⊤}) {[X]}_{k - 1} = (I_{p_{2} \dots p_{k - 1}} \otimes P_{U_{1}}) {[X]}_{k - 1} = {[X]}_{k - 1}$ . Similarly to (29), by (28), for 1 ≤ j ≤ k − 2,

{‖ (I_{p_{j + 1} \dots p_{k - 1}} \otimes {\hat{U}}_{j}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} = {‖ (I_{p_{j + 2} \dots p_{k - 1}} \otimes P_{{\tilde{U}}_{j + 1}}) (I_{p_{j + 1} \dots p_{k - 1}} \otimes {\hat{U}}_{j}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} = {‖ (I_{p_{j + 2} \dots p_{k - 1}} \otimes {\tilde{U}}_{j + 1}^{⊤}) (I_{p_{j + 1} \dots p_{k - 1}} \otimes {\hat{U}}_{j}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) \cdot {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq {‖ (I_{p_{j + 2} \dots p_{k - 1}} \otimes {\hat{U}}_{j + 1}^{⊤}) (I_{p_{j + 1} \dots p_{k - 1}} \otimes {\hat{U}}_{j}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} ({\hat{U}}_{j + 1}^{⊤} {\tilde{U}}_{j + 1}) .

(30)

By (29) and (30),

{‖ {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq {‖ (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} ({\hat{U}}_{1}^{⊤} U_{1}) \leq {‖ (I_{p_{3} \dots p_{k - 1}} \otimes {\hat{U}}_{2}^{⊤}) (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} (U_{1}^{⊤} {\hat{U}}_{1}) \cdot s_{min}^{- 1} ({\tilde{U}}_{2}^{⊤} {\hat{U}}_{2}) \leq \dots \leq {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot s_{min}^{- 1} (U_{1}^{⊤} {\hat{U}}_{1}) s_{min}^{- 1} ({\tilde{U}}_{2}^{⊤} {\hat{U}}_{2}) \dots s_{min}^{- 1} ({\tilde{U}}_{k - 1}^{⊤} {\hat{U}}_{k - 1}) \leq {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \cdot {(\frac{1}{\sqrt{1 - c_{0}^{2}}})}^{k - 1} \leq C {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} .

(31)

By the definition of ${\hat{V}}_{k} \in ℝ^{(p_{k} r_{k}) \times r_{k - 1}}$ and Lemma III.3, we know that ${\hat{V}}_{k}$ is the right singular space of

{\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) = {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) + {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}),

Lemma A.3 shows that

{‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) {\hat{V}}_{k ⊥} ‖}_{F} \leq 2 {‖ {\hat{U}}_{k - 1}^{⊤} (I_{p_{k - 1}} \otimes {\hat{U}}_{k - 2}^{⊤}) \dots (I_{p_{2} \dots p_{k - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{k - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{k} \dots p_{d - 1}}) \dots ({\hat{V}}_{k + 1} \otimes I_{p_{k}}) ‖}_{F} .

(32)

Combine (26), (31) and (32) together, we know that (25) holds for all 2 ≤ k ≤ d, which has finished the proof of Theorem III.1.

B. Proof of Theorem III.2

For i ≥ 1, by the definition of $X^{(2 i)}$ and Lemma III.1, we have

{‖ Y - {\hat{X}}^{(2 i)} ‖}_{F}^{2} = {‖ (I_{p_{1} \dots p_{d - 1}} - P_{(I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i)}) \dots (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i)}) {\hat{U}}_{d - 1}^{(2 i)}}) \cdot {[Y]}_{d - 1} ‖}_{F}^{2} = {‖ {[Y]}_{d - 1} ‖}_{F}^{2} - {‖ P_{(I_{p_{2}} \dots p_{d - 1} \otimes {\hat{U}}_{1}^{(2 i)}) \dots (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i)}) {\hat{U}}_{d - 1}^{(2 i)}} {[Y]}_{d - 1} ‖}_{F}^{2} = ‖ Y ‖_{F}^{2} - {‖ {\hat{X}}^{(2 i)} ‖}_{F}^{2} .

Similarly, we have

{‖ Y - {\hat{X}}^{(2 i - 1)} ‖}_{F}^{2} = ‖ Y ‖_{F}^{2} - {‖ {\hat{X}}^{(2 i - 1)} ‖}_{F}^{2} .

In addition, we have

{‖ Y - {\hat{X}}^{(2 i)} ‖}_{F}^{2} = {‖ {[Y]}_{d - 1} ‖}_{F}^{2} - {‖ P_{(I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i)}) \dots (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i)}) {\hat{U}}_{d - 1}^{(2 i)}} {[Y]}_{d - 1} ‖}_{F}^{2} = {‖ {[Y]}_{d - 1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 1}^{(2 i) ⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 1} ‖}_{F}^{2} = {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 1}^{(2 i) ⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 1} {\hat{V}}_{d}^{(2 i - 1)} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 1}^{(2 i) ⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 1} {\hat{V}}_{d ⊥}^{(2 i - 1)} ‖}_{F}^{2} \leq {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 1}^{(2 i) ⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 1} {\hat{V}}_{d}^{(2 i - 1)} ‖}_{F}^{2} = {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) (I_{p_{d - 2} p_{d - 1}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 1} {\hat{V}}_{d}^{(2 i - 1)} ‖}_{F}^{2} .

The last equation holds since ${\hat{U}}_{d - 1}^{(2 i)}$ is the left singular space of $(I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i)}^{⊤}) (I_{p_{d - 2} p_{d - 1}} \otimes {\hat{U}}_{d - 3}^{(2 i)}^{⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i)}^{⊤}) {[Y]}_{d - 1} {\hat{V}}_{d}^{(2 i - 1)}$ For any $B \in ℝ^{n \times r}$ and 1 ≤ l ≤ r, we can check that the l-th columns of A^(m,n)B and (I_m ⊗ B ⊗ I_m)A^(m,r) are equal:

{(A^{(m, n)} B)}_{[:, l]} = \sum_{j = 1}^{n} B_{j, l} \sum_{k = 1}^{m} e_{(k - 1) m n + (j - 1) m + k}^{(m^{2} n)} = {((I_{m} \otimes B \otimes I_{m}) A^{(m, r)})}_{[:, l]}

where $e_{(k - 1) m n + (j - 1) m + k}^{(m^{2} n)}$ is the ((k−1)mn+(j−1)m+k)-th canonical basis of $ℝ^{m^{2} n}$ and A^(i,j) is defined in (5). Therefore,

A^{(m, n)} B = (I_{m} \otimes B \otimes I_{m}) A^{(m, r)} .

By the last equation and Lemma III.2, we have

(I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) (I_{p_{d - 2} p_{d - 1}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) {[Y]}_{d - 1} {\hat{V}}_{d}^{(2 i - 1)} = (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{(2 i) ⊤}) (I_{p_{d - 2} p_{d - 1}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) (I_{p_{d - 1}} \otimes {[Y]}_{d - 2}) A^{(p_{d - 1}, p_{d})} {\hat{V}}_{d}^{(2 i - 1)} = (I_{p_{d - 1}} \otimes ({\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) {[Y]}_{d - 2})) \cdot (I_{p_{d - 1}} \otimes ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}})) A^{(p_{d - 1}, r_{d - 1})} = (I_{p_{d - 1}} \otimes ({\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}))) \cdot A^{(p_{d - 1}, r_{d - 1})} = Reshape ({\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}), r_{d - 2} p_{d - 1}, r_{d - 1}) .

Since the realignment does not change the Frobenius norm, we have

{‖ Y - {\hat{X}}^{(2 i)} ‖}_{F}^{2} \leq {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) {[Y]}_{d - 2} \cdot ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) ‖}_{F}^{2} .

(33)

By similar proof of (33), we have

{‖ Y - {\hat{X}}^{(2 i)} ‖}_{F}^{2} \leq {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) ‖}_{F}^{2} = {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) {\hat{V}}_{d - 1}^{(2 i - 1)} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) {\hat{V}}_{d - 1 ⊥}^{(2 i - 1)} ‖}_{F}^{2} \leq {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {\hat{U}}_{d - 2}^{(2 i) ⊤} (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) {\hat{V}}_{d - 1}^{(2 i - 1)} ‖}_{F}^{2} = {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ (I_{p_{d - 2}} \otimes {\hat{U}}_{d - 3}^{(2 i) ⊤}) \dots (I_{p_{2} \dots p_{d - 2}} \otimes {\hat{U}}_{1}^{(2 i) ⊤}) \cdot {[Y]}_{d - 2} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{d - 1}}) {\hat{V}}_{d - 1}^{(2 i - 1)} ‖}_{F}^{2} \leq \dots \leq {‖ {[Y]}_{1} ‖}_{F}^{2} - {‖ {[Y]}_{1} ({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3}^{(2 i - 1)} \otimes I_{p_{2}}) {\hat{V}}_{2}^{(2 i - 1)} ‖}_{F}^{2} = {‖ {[Y]}_{1} (I_{p_{2} \dots p_{d}} - P_{({\hat{V}}_{d}^{(2 i - 1)} \otimes I_{p_{2} \dots p_{d - 1}}) \dots ({\hat{V}}_{3}^{(2 i - 1)} \otimes I_{p_{2}}) {\hat{V}}_{2}^{(2 i - 1)}}) ‖}_{F}^{2} = {‖ Y - {\hat{X}}^{(2 i - 1)} ‖}_{F}^{2} .

Similarly, we can prove (11) holds for k = 2i, i ≥ 0.

C. Proof of Theorem IV.1

Without loss of generality, we assume σ² = 1. We still let ${\hat{U}}_{i}$ , $\hat{V}$ , R_i and ${\tilde{R}}_{i}$ denote ${\hat{U}}_{i}^{(0)}$ , ${\hat{V}}_{i}^{(1)}$ , $R_{i}^{(0)}$ and ${\tilde{R}}_{i}^{(0)}$ , respectively.

Lemma A.2 Part 4 immediately shows that (15) holds with probability at least 1 − Ce^−cp. Next, we show that with probability at least 1 − Ce^−cp,

‖ sin Θ ({\hat{U}}_{k}, {\tilde{U}}_{k}) ‖ \leq C \frac{\sqrt{\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i}} + \sqrt{p_{k} r_{k - 1}} + \sqrt{p_{k + 1} \dots p_{d}}}{λ_{k}} \leq \frac{1}{2}, \forall 1 \leq k \leq d - 1.

(34)

Recall that

{\hat{U}}_{1} = {SVD}_{r_{1}}^{L} ({[Y]}_{1}), {[Y]}_{1} = {[X]}_{1} + {[Z]}_{1},

where ${[X]}_{1} \in ℝ^{p_{1} \times p_{- 1}}$ satisfying $rank ({[X]}_{1}) = r_{1}, {[Z]}_{1} \in ℝ^{p_{1} \times p_{- 1}}$ , by Lemmas A.3 and A.2, with probability 1−Ce^−cp, we have

‖ {\hat{U}}_{1 ⊥}^{⊤} {[X]}_{1} ‖ \leq 2 ‖ {[Z]}_{1} ‖ \leq C (p_{1}^{1 / 2} + {(p_{2} \dots p_{d})}^{1 / 2}) .

Therefore, with probability at least 1 − Ce^−cp,

‖ sin Θ ({\hat{U}}_{1}, U_{1}) ‖ \leq \frac{‖ {\hat{U}}_{1 ⊥}^{⊤} U_{1} U_{1}^{⊤} {[X]}_{1} ‖}{s_{r_{1}} (U_{1}^{⊤} {[X]}_{1})} = \frac{‖ {\hat{U}}_{1 ⊥}^{⊤} {[X]}_{1} ‖}{s_{r_{1}} ({[X]}_{1})} \leq C \frac{\sqrt{p_{1}} + \sqrt{p_{2} \dots p_{d}}}{λ_{1}} .

For 2 ≤ i ≤ j ≤ d − 1, by the definition of ${\tilde{U}}_{i}$ and Lemma III.2, we have

{[X]}_{j} = (I_{p_{2} \dots p_{j}} \otimes {[X]}_{1}) A^{(p_{2} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{2} \dots p_{j}} \otimes (P_{U_{1}} {[X]}_{1})) A^{(p_{2} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{2} \dots p_{j}} \otimes P_{U_{1}}) (I_{p_{2} \dots p_{j}} \otimes {[X]}_{1}) A^{(p_{2} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{2} \dots p_{j}} \otimes U_{1}) (I_{p_{2} \dots p_{j}} \otimes U_{1}^{⊤}) {[X]}_{j}

(35)

and

(I_{p_{i} \dots p_{j}} \otimes {\hat{U}}_{i - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j} = (I_{p_{t + 1} \dots p_{j}} \otimes (I_{p_{t}} \otimes {\hat{U}}_{i - 1}^{⊤})) \dots (I_{p_{t + 1} \dots p_{j}} \otimes (I_{p_{2} \dots p_{t}} \otimes {\hat{U}}_{1}^{⊤})) \cdot (I_{p_{i + 1} \dots p_{j}} \otimes {[X]}_{i}) A^{(p_{t + 1} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{i + 1} \dots p_{j}} \otimes ((I_{p_{i}} \otimes {\hat{U}}_{i - 1}^{⊤}) \dots (I_{p_{2} \dots p_{i}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{i})) \cdot A^{(p_{i + 1} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{i + 1} \dots p_{j}} \otimes (P_{{\tilde{U}}_{i}} (I_{p_{i}} \otimes {\hat{U}}_{i - 1}^{⊤}) \dots (I_{p_{2} \dots p_{i}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{i})) \cdot A^{(p_{i + 1} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{i + 1} \dots p_{j}} \otimes P_{{\tilde{U}}_{i}}) \cdot (I_{p_{i + 1} \dots p_{j}} \otimes ((I_{p_{i}} \otimes {\hat{U}}_{i - 1}^{⊤}) \dots (I_{p_{2} \dots p_{i}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{i})) \cdot A^{(p_{i + 1} \dots p_{j}, p_{j + 1} \dots p_{d})} = (I_{p_{i + 1} \dots p_{j}} \otimes {\tilde{U}}_{i}) (I_{p_{i + 1} \dots p_{j}} \otimes {\tilde{U}}_{i}^{⊤}) (I_{p_{i} \dots p_{j}} \otimes {\hat{U}}_{i - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j},

(36)

where $I_{p_{i + 1} \dots p_{j}} = 1$ if i = j. Let

L_{k} = ‖ sin Θ ({\tilde{U}}_{k}, {\hat{U}}_{k}) ‖, 2 \leq k \leq d - 1.

For k = 2, by (35) and Lemma A.1, with probability at least 1 − Ce^−cp,

s_{r_{2}} ((I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2}) \geq s_{min} ((I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) (I_{p_{2}} \otimes U_{1})) s_{r_{2}} ({[X]}_{2}) = s_{min} ({\hat{U}}_{1}^{⊤} U_{1}) λ_{2} = \sqrt{1 - {‖ sin Θ ({\hat{U}}_{1}, U_{1}) ‖}^{2}} λ_{2} \geq \sqrt{\frac{3}{4}} λ_{2} .

Since ${\hat{U}}_{2} = {SVD}_{r_{2}}^{L} ((I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{2})$ , and $(I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{2} = (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2} + (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{2}$ , by Lemma A.3 and Lemma A.1, we know that with probability at least 1 − Ce^−cpr,

‖ {\hat{U}}_{2 ⊥}^{⊤} (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2} ‖ \leq 2 ‖ (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{2} ‖ \leq C (\sqrt{p_{2} r_{1}} + {(p_{3} \dots p_{d})}^{1 / 2} + \sqrt{p_{1} r_{1}}) .

Combine the two previous inequalities together and recall that ${\tilde{U}}_{2}$ is the left singular space of $(I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2}$ , we have

‖ sin Θ ({\hat{U}}_{2}, {\tilde{U}}_{2}) ‖ \leq \frac{‖ {\hat{U}}_{2 ⊥}^{⊤} {\tilde{U}}_{2} {\tilde{U}}_{2}^{⊤} (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2} ‖}{s_{r_{2}} ({\tilde{U}}_{2}^{⊤} (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2})} = \frac{‖ {\hat{U}}_{2 ⊥}^{⊤} (I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2} ‖}{s_{r_{2}} ((I_{p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{2})} \leq C \frac{\sqrt{p_{1} r_{1}} + \sqrt{p_{2} r_{1}} + {(p_{3} \dots p_{d})}^{1 / 2}}{λ_{2}}

with probability at least 1 − Ce^−cp.

Assume that (34) holds for k ≤ j − 1 with probability 1 − Ce^−cp. For k = j, by Lemma A.1 and (36), with probability at 1 − Ce^−cp, we have

s_{r_{j}} ((I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) \cdot {[X]}_{j}) \geq s_{min} ((I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j}} \otimes {\tilde{U}}_{j - 1})) \cdot s_{r_{j}} ((I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}) = s_{min} ({\hat{U}}_{j - 1}^{⊤} {\tilde{U}}_{j - 1}) \cdot s_{r_{j}} ((I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}) \geq s_{min} ({\hat{U}}_{j - 1}^{⊤} {\tilde{U}}_{j - 1}) \cdot s_{min} ((I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\tilde{U}}_{j - 2})) \cdot s_{r_{j}} ((I_{p_{j - 2} p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 3}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j}) \geq \dots \geq s_{min} ({\hat{U}}_{j - 1}^{⊤} {\tilde{U}}_{j - 1}) \dots s_{min} ({\hat{U}}_{1}^{⊤} {\tilde{U}}_{1}) s_{r_{j}} ({[X]}_{j}) = \sqrt{1 - L_{j - 1}^{2}} \dots \sqrt{1 - L_{1}^{2}} λ_{j} \geq {(\sqrt{3 / 4})}^{j - 1} λ_{j} \geq c λ_{j} .

(37)

In the last inequality, we used the fact that d is a fixed number and ${(\sqrt{3 / 4})}^{j - 1} \geq {(\sqrt{3 / 4})}^{d - 1} \geq c$ .

By the definition of ${\hat{U}}_{j}$ and Lemma III.3, we have

{\hat{U}}_{j} = {SVD}_{r_{j}}^{L} ((I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{j}) .

Note that

(I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{j} = (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j} + (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{j},

by Lemma A.3, with probability at least $1 - e^{- c p r^{2}}$ ,

‖ {\hat{U}}_{j ⊥}^{⊤} (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j} ‖ \leq 2 ‖ (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) (I_{p_{j - 1} p_{j}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1} p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{j} ‖ \leq C [{(\sum_{i = 1}^{j - 1} p_{i} r_{i - 1} r_{i})}^{1 / 2} + {(p_{j} r_{j - 1})}^{1 / 2} + {(p_{j + 1} \dots p_{d})}^{1 / 2}] .

Therefore, with probability at least 1 − Ce^−cp,

‖ sin Θ ({\hat{U}}_{j}, {\tilde{U}}_{j}) ‖ \leq \frac{‖ {\hat{U}}_{j ⊥}^{⊤} {\tilde{U}}_{j} {\tilde{U}}_{j}^{⊤} (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j} ‖}{s_{r_{j}} ({\tilde{U}}_{j}^{⊤} (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j})} = \frac{‖ {\hat{U}}_{j ⊥}^{⊤} (I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j} ‖}{s_{r_{j}} ((I_{p_{j}} \otimes {\hat{U}}_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j})} \leq C \frac{{(\sum_{i = 1}^{j - 1} p_{i} r_{i - 1} r_{i})}^{1 / 2} + {(p_{j} r_{j - 1})}^{1 / 2} + {(p_{j + 1} \dots p_{d})}^{1 / 2}}{λ_{j}} .

Therefore, (13) holds with probability 1 − Ce^−cp.

Finally, we consider (14). Let $E_{0} = {(13) and (15) hold}$ . Without loss of generality, we only show that under $E_{0}$ ,

‖ sin Θ ({\hat{V}}_{k}, {\tilde{V}}_{k}) ‖ \leq C \frac{\sqrt{\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}}}{λ_{k - 1}} \leq \frac{1}{2}, \forall 2 \leq k \leq d .

(38)

In fact, (38) can be proved by induction. Let $V_{d} \in ℝ^{p_{d} \times r_{d - 1}}$ be the right singular space of ${[X]}_{d - 1}$ . Then there exists an orthogonal matrix ${\tilde{Q}}_{d - 1} \in O_{r_{d - 1}}$ such that

V_{d} {\tilde{Q}}_{d - 1} = {SVD}^{R} ({\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{d - 1}) .

Similarly to (37), under $E_{0}$ ,

s_{r_{d - 1}} ({\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{d - 1}) \geq {(\sqrt{3 / 4})}^{d - 1} λ_{d - 1} \geq c λ_{d - 1} .

Therefore, by Lemma A.3, under $E_{0}$ ,

‖ sin Θ ({\hat{V}}_{d}, V_{d}) ‖ = ‖ sin Θ ({\hat{V}}_{d}, V_{d} {\tilde{Q}}_{d - 1}) ‖ \leq \frac{‖ {\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{d - 1} {\hat{V}}_{d ⊥}^{⊤} ‖}{s_{r_{d - 1}} ({\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{d - 1})} \leq \frac{2 ‖ {\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{d - 1} ‖}{s_{r_{d - 1}} ({\hat{U}}_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes {\hat{U}}_{d - 2}^{⊤}) \dots (I_{p_{d - 1} \dots p_{2}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{d - 1})} \leq C \frac{\sqrt{\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}}}{λ_{d - 1}} .

Suppose (38) holds for j + 1 ≤ k ≤ d. For k = j, since ${\hat{V}}_{j}$ is the right singular space of ${[X]}_{j - 1} ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})$ , there exists ${\tilde{Q}}_{j - 1} \in O_{r_{j - 1}}$ such that

{\tilde{V}}_{j} {\tilde{Q}}_{j - 1} = S V D^{R} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) \cdot {[X]}_{j - 1} ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})) .

By Lemma A.1, (35), (36) and (37), under $E_{0}$

s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) \cdot {[X]}_{j - 1} ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})) \geq s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 2} \otimes I_{p_{j} p_{j + 1}}) \cdot ({\tilde{V}}_{j + 1} \otimes I_{p_{j}})) \cdot s_{min} (({\tilde{V}}_{j + 1}^{⊤} \otimes I_{p_{j}}) ({\hat{V}}_{j + 1} \otimes I_{p_{j}})) = s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 2} \otimes I_{p_{j} p_{j + 1}})) \cdot s_{min} ({\tilde{V}}_{j + 1}^{⊤} {\hat{V}}_{j + 1}) \geq \dots \geq s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1}) \cdot s_{min} ({\tilde{V}}_{d}^{⊤} {\hat{V}}_{d}) \dots s_{min} ({\tilde{V}}_{j + 1}^{⊤} {\hat{V}}_{j + 1}) \geq s_{min} ({\hat{U}}_{j - 1}^{⊤} {\tilde{U}}_{j - 1}) \cdot s_{r_{j - 1}} ((I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1}) \cdot s_{min} ({\tilde{V}}_{d}^{⊤} {\hat{V}}_{d}) \dots s_{min} ({\tilde{V}}_{j + 1}^{⊤} {\hat{V}}_{j + 1}) \geq {(\sqrt{\frac{3}{4}})}^{j - 1} λ_{j - 1} \cdot {(\sqrt{\frac{3}{4}})}^{d - j} \geq c λ_{j - 1} .

Note that ${\hat{V}}_{j} \in O_{p_{j} r_{j}, r_{j - 1}}$ is the right singular space of ${\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{j - 1} ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})$ and

{\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Y]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}}) = {\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}}) + {\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}}),

By Lemma A.3, under $E_{0}$ ,

‖ sin Θ ({\hat{V}}_{j}, {\tilde{V}}_{j}) ‖ = ‖ sin Θ ({\hat{V}}_{j}, {\tilde{V}}_{j} {\tilde{Q}}_{j - 1}) ‖ \leq ‖ {\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}}) {\hat{V}}_{j ⊥} ‖ / s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})) \leq 2 ‖ {\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[Z]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}}) ‖ / s_{r_{j - 1}} ({\hat{U}}_{j - 1}^{⊤} (I_{p_{j - 1}} \otimes {\hat{U}}_{j - 2}^{⊤}) \dots (I_{p_{2} \dots p_{j - 1}} \otimes {\hat{U}}_{1}^{⊤}) {[X]}_{j - 1} \cdot ({\hat{V}}_{d} \otimes I_{p_{j} \dots p_{d - 1}}) \dots ({\hat{V}}_{j + 1} \otimes I_{p_{j}})) \leq C \frac{{(\sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1})}^{1 / 2}}{λ_{j - 1}} .

Therefore, under $E_{0}$ , (38) holds.

Thus, we have finished the proof of Theorem IV.1.

D. Proof of Corollary IV.1

Let and Q = {(15), (34) hold}, then $ℙ (Q^{c}) \leq C exp (- c p)$ and

{‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} \leq C \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} under Q .

Under Q^c, due to the property of projection matrices, we know that

{‖ {\hat{X}}^{(t)} ‖}_{F} \leq ‖ Y ‖_{F} \leq ‖ X ‖_{F} + ‖ Z ‖_{F} .

Moreover,

E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{4} \leq C (E {‖ {\hat{X}}^{(t)} ‖}_{F}^{4} + ‖ X ‖_{F}^{4}) \leq C ‖ X ‖_{F}^{4} + C E ‖ Z ‖_{F}^{4} \leq C exp (4 c_{0} p) + C E {(χ_{p_{1} \dots p_{d}}^{2})}^{2} \leq C exp (4 c_{0} p) + C {(p_{1} \dots p_{d})}^{2} \leq C exp (4 c_{0} p) + C exp (2 c_{0} p) \leq C exp (4 c_{0} p) .

Therefore, we have the following upper bound for the Frobenius norm risk of $\hat{X}$ :

E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} = E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} 1_{Q} + E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} 1_{Q^{c}} \leq C \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} + \sqrt{E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{4} \cdot ℙ (Q^{c})} \leq C \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1} + C exp ((4 c_{0} - c) p / 2) .

By selecting c₀ < c/4, we have

E {‖ {\hat{X}}^{(t)} - X ‖}_{F}^{2} \leq C \sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1}

Therefore, we have finished the proof of Corollary IV.1.

E. Proof of Theorem IV.2

Since the i.i.d. Gaussian distribution, $Z ~ N (0, σ^{2})$ , is a special case of $D$ and

inf_{\hat{X}} sup_{X \in F_{p, r} (λ), D \in D} E_{Z ~ D} ‖ \hat{X} - X ‖_{F}^{2} \geq inf_{\hat{X}} sup_{X \in F_{p, r} (λ), Z \overset{i.i.d.}{~} N (0, σ^{2})} E_{Z ~ D} ‖ \hat{X} - X ‖_{F}^{2},

we only need to focus on the setting that $Z ~ N (0, σ^{2})$ while developing the lower bound result.

Without loss of generality, assume σ² = 1. Since d is a fixed number, we only need to show that for any 1 ≤ i ≤ d,

inf_{\hat{X}} sup_{X \in F_{p, r} (λ)} E ‖ \hat{X} - X ‖_{F}^{2} \geq c p_{i} r_{i} r_{i - 1} .

(39)

Suppose $X$ can be written as (1), $U_{j} \in ℝ^{(p_{j} r_{j - 1}) \times r_{j}}$ and $V_{j} \in ℝ^{(p_{j} r_{j}) \times r_{j - 1}}$ are reshaped from $G_{j} \in ℝ^{r_{j - 1} \times p_{j} \times r_{j}}$ , G₁ = U₁, G_d = V_d. For any 1 ≤ i ≤ d − 1, by Lemma III.1, we have

{[X]}_{i} = (I_{p_{2} \dots p_{i}} \otimes U_{1}) \dots (I_{p_{i}} \otimes U_{i - 1}) U_{i} V_{i + 1}^{⊤} \cdot (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}}) .

(40)

For all j ≠ i, 1 ≤ j ≤ d−1, let $U_{j} \overset{i.i.d.}{~} N (0, 1)$ , $V_{d} \overset{i.i.d.}{~} N (0, 1)$ and U₁, …, U_i−1, U_i+1, …, U_d−1, V_d are all independent. By Lemma A.1, for any 1 ≤ j ≤ d − 1, we have

s_{r_{j}} ((I_{p_{2} \dots p_{j}} \otimes U_{1}) \dots (I_{p_{j}} \otimes U_{j - 1}) U_{j}) \geq s_{min} (I_{p_{2} \dots p_{j}} \otimes U_{1}) \dots s_{min} (U_{j}) = s_{r_{1}} (U_{1}) \dots s_{r_{j}} (U_{j}) .

Similarly,

s_{r_{j}} (V_{j + 1}^{⊤} (V_{j + 2}^{⊤} \otimes I_{p_{j + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{j + 1} \dots p_{d - 1}})) \geq s_{r_{j}} (V_{j + 1}) \dots s_{r_{d - 1}} (V_{d}) .

Moreover, Lemma A.1 Part 1 tells us

s_{r_{j}} ((I_{p_{2} \dots p_{j}} \otimes U_{1}) \dots (I_{p_{j}} \otimes U_{j - 1}) U_{j} V_{j + 1}^{⊤} \cdot (V_{j + 2}^{⊤} \otimes I_{p_{j + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{j + 1} \dots p_{d - 1}})) \geq s_{r_{j}} ((I_{p_{2} \dots p_{j}} \otimes U_{1}) \dots (I_{p_{j}} \otimes U_{j - 1}) U_{j}) \cdot s_{r_{j}} (V_{j + 1}^{⊤} (V_{j + 2}^{⊤} \otimes I_{p_{j + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{j + 1} \dots p_{d - 1}})) \geq s_{r_{1}} (U_{1}) \dots s_{r_{j}} (U_{j}) s_{r_{j}} (V_{j + 1}) \dots s_{r_{d - 1}} (V_{d}) .

(41)

Recall that V_j is reshaped from U_j for all 1 ≤ j ≤ d − 1, by [93][Corollary 5.35], we know that with probability at least 1 − Ce^−cp, for all 1 ≤ j ≤ d − 1, j ≠ i,

\frac{\sqrt{p_{j} r_{j - 1}}}{4} \leq \sqrt{p_{j} r_{j - 1}} - \sqrt{r_{j}} - \frac{\sqrt{p_{j} r_{j - 1}}}{25} \leq s_{r_{j}} (U_{j}) \leq s_{1} (U_{j}) \leq \sqrt{p_{j} r_{j - 1}} + \sqrt{r_{j}} + \frac{\sqrt{p_{j} r_{j - 1}}}{25} \leq 2 \sqrt{p_{j} r_{j - 1}}, \frac{\sqrt{p_{j} r_{j}}}{4} \leq s_{r_{j - 1}} (V_{j}) \leq s_{1} (V_{j}) \leq 2 \sqrt{p_{j} r_{j}}, and \frac{\sqrt{p_{d}}}{4} \leq s_{r_{d - 1}} (V_{d}) \leq s_{r_{1}} (V_{d}) \leq 2 \sqrt{p_{d}} .

(42)

For a fixed $U_{0} \in O_{p_{i} r_{i - 1}, r_{i}}$ , define the following ball with radius ε > 0,

B (U_{0}, ε) = {U^{'} \in O_{p_{i} r_{i - 1}, r_{i}} : {‖ sin Θ (U^{'}, U_{0}) ‖}_{F} \leq ε} .

By Lemma 1 in [94], for 0 < α < 1 and 0 < ε ≤ 1, there exist ${\tilde{U}}_{i}^{{(1)}^{'}}, \dots, {\tilde{U}}_{i}^{{(m)}^{'}} \subseteq B (U_{0}, ε)$ such that

m \geq {(\frac{c_{0}}{α})}^{r_{i} (p_{i} r_{i - 1} - r_{i})}, min_{1 \leq j \neq k \leq m} {‖ sin Θ ({\tilde{U}}_{i}^{{(j)}^{'}}, {\tilde{U}}_{i}^{{(k)}^{'}}) ‖}_{F} \geq α ε .

By Lemma 1 in [37], one can find a rotation matrix $O_{k} \in O_{r_{i}}$ such that

{‖ U_{0} - {\tilde{U}}_{i}^{{(k)}^{'}} O_{k} ‖}_{F} \leq \sqrt{2} {‖ sin Θ (U_{0}, {\tilde{U}}_{i}^{{(k)}^{'}}) ‖}_{F} \leq \sqrt{2} ε .

Let ${\tilde{U}}_{i}^{(k)} = {\tilde{U}}_{i}^{{(k)}^{'}} O_{k}$ , we have

{‖ {\tilde{U}}_{i}^{(k)} - U_{0} ‖}_{F} \leq \sqrt{2} ε, {‖ sin Θ ({\tilde{U}}_{i}^{(j)}, {\tilde{U}}_{i}^{(k)}) ‖}_{F} \geq α ε, 1 \leq j < k \leq m .

Let $U_{i}^{(k)} = S + {\tilde{U}}_{i}^{(k)}$ , where $S \overset{i.i.d.}{~} N (0, τ^{2})$ . Set $τ \geq 8 / \sqrt{p_{i}}$ , [93][Corollary 5.35] shows that with probability at least 1 − Ce^−cp,

\frac{τ \sqrt{p_{i} r_{i - 1}}}{8} \leq τ (\sqrt{p_{i} r_{i - 1}} - \sqrt{r_{i}} - \frac{\sqrt{p_{i} r_{i - 1}}}{25}) - 1 \leq s_{r_{i}} (S) - s_{1} ({\tilde{U}}_{i}^{(k)}) \leq s_{r_{i}} (U_{i}^{(k)}) \leq s_{1} (U_{i}^{(k)}) \leq s_{1} (S) + s_{1} ({\tilde{U}}_{i}^{(k)}) \leq τ (\sqrt{p_{i} r_{i - 1}} + \sqrt{r_{i}} + \frac{\sqrt{p_{i} r_{i - 1}}}{25}) + 1 \leq 2 τ \sqrt{p_{i} r_{i - 1}} .

(43)

If 2 ≤ i ≤ d − 1, since $V_{i}^{(k)}$ is reshaped from $U_{i}^{(k)}$ , we know that $V_{i}^{(k)} = T + {\tilde{V}}_{i}^{(k)}$ , where $T \overset{i.i.d.}{~} N (0, τ^{2})$ , and ${\tilde{V}}_{i}^{(k)}$ is realigned from ${\tilde{U}}_{i}^{(k)}$ . Notice that

s_{1} ({\tilde{V}}_{i}^{(k)}) = ‖ {\tilde{V}}_{i}^{(k)} ‖ \leq {‖ {\tilde{V}}_{i}^{(k)} ‖}_{F} = {‖ {\tilde{U}}_{i}^{(k)} ‖}_{F} = r_{i},

Since $τ \geq 8 / \sqrt{p_{i}}$ , by [93][Corollary 5.35], with probability at least $1 - C e^{- c p_{i} r_{i}}$ ,

\frac{τ \sqrt{p_{i} r_{i}}}{8} \leq τ (\sqrt{p_{i} r_{i}} - \sqrt{r_{i - 1}} - \frac{\sqrt{p_{i} r_{i}}}{25}) - \sqrt{r_{i}} \leq s_{r_{i}} (T) - s_{1} ({\tilde{V}}_{i}^{(k)}) \leq s_{r_{i}} (V_{i}^{(k)}) \leq s_{1} (V_{i}^{(k)}) \leq s_{1} (T) + s_{1} ({\tilde{V}}_{i}^{(k)}) \leq τ (\sqrt{p_{i} r_{i}} + \sqrt{r_{i - 1}} + \frac{\sqrt{p_{i} r_{i}}}{25}) + \sqrt{r_{i}} \leq 2 τ \sqrt{p_{i} r_{i}} .

(44)

Choose fixed U₁, …, U_i−1, V_i+1, ⋯, V_d, S such that (42), (43) and (44) hold. Let

{[X^{(k)}]}_{i} = (I_{p_{2} \dots p_{i}} \otimes U_{1}) \dots (I_{p_{i}} \otimes U_{i - 1}) U_{i}^{(k)} V_{i + 1}^{⊤} \cdot (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}})

(45)

and $X^{(k)} \in ℝ^{p_{1} \times \dots \times p_{d}}$ is the corresponding tensor. (41), (42), (43) and (44) together show that

σ_{r_{j}} ({[X^{(k)}]}_{j}) \geq τ \prod_{k = 1}^{j} \frac{\sqrt{p_{k} r_{k - 1}}}{8} \prod_{k = j + 1}^{d} \frac{\sqrt{p_{k} r_{k}}}{8} = τ \frac{\sqrt{p_{1} \dots p_{d} r_{1} \dots r_{d - 1}}}{C \sqrt{r_{j}}}

(46)

By setting $τ = \frac{C {max}_{1 \leq i \leq d - 1} λ_{i} {max}_{1 \leq j \leq d - 1} \sqrt{r_{j}}}{\sqrt{p_{1} \dots p_{d} r_{1} \dots r_{d - 1}}} \lor 8 {max}_{1 \leq i \leq d - 1} \sqrt{1 / p_{i}}$ , we have

σ_{r_{j}} ({[X^{(k)}]}_{j}) \geq λ_{j}, \forall 1 \leq j \leq d - 1

For 1 ≤ k < j ≤ m,

{‖ X^{(k)} - X^{(j)} ‖}_{F}^{2} = ∥ (I_{p_{2} \dots p_{i}} \otimes U_{1}) \dots (I_{p_{i}} \otimes U_{i - 1}) (U_{i}^{(k)} - U_{i}^{(j)}) \cdot V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}}) ∥_{F}^{2} \geq s_{min}^{2} ((I_{p_{2} \dots p_{i}} \otimes U_{1}) \dots (I_{p_{i}} \otimes U_{i - 1})) \cdot ∥ (U_{i}^{(k)} - U_{i}^{(j)}) V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}}) ∥_{F}^{2} = s_{r_{i - 1}}^{2} ((I_{p_{2} \dots p_{i - 1}} \otimes U_{1}) \dots U_{i - 1}) \cdot s_{r_{i}}^{2} (V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}})) \cdot {‖ U_{i}^{(k)} - U_{i}^{(j)} ‖}_{F}^{2} = s_{r_{i - 1}}^{2} ((I_{p_{2} \dots p_{i - 1}} \otimes U_{1}) \dots U_{i - 1}) \cdot s_{r_{i}}^{2} (V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}})) \cdot {‖ {\tilde{U}}_{i}^{(k)} - {\tilde{U}}_{i}^{(j)} ‖}_{F}^{2} \geq s_{r_{1}}^{2} (U_{1}) \dots s_{r_{i - 1}}^{2} (U_{i - 1}) s_{r_{i}}^{2} (V_{i + 1}) \dots s_{r_{d - 1}}^{2} (V_{d}) \cdot min_{O \in O_{r_{i}}} {‖ {\tilde{U}}_{i}^{(k)} - {\tilde{U}}_{i}^{(j)} O ‖}_{F}^{2} \geq \prod_{h = 1}^{i - 1} \frac{p_{h} r_{h - 1}}{16} \prod_{l = i + 1}^{d} \frac{p_{l} r_{l}}{16} min_{O \in O_{r_{i}}} {‖ {\tilde{U}}_{i}^{(k)} - {\tilde{U}}_{i}^{(j)} O ‖}_{F}^{2} \geq \prod_{h = 1}^{i - 1} \frac{p_{h} r_{h - 1}}{16} \prod_{l = i + 1}^{d} \frac{p_{l} r_{l}}{16} {‖ sin Θ ({\tilde{U}}_{i}^{(k)}, {\tilde{U}}_{i}^{(j)}) ‖}_{F}^{2} \geq c (\prod_{h = 1}^{i - 1} p_{h} T_{h - 1} \prod_{l = i + 1}^{d} p_{l} r_{l}) α^{2} ε^{2} .

In addition, let $Y^{(k)} = X^{(k)} + Z^{(k)}$ and $Z^{(k)} \overset{i.i.d.}{~} N (0, 1)$ . The KL-divergence between distributions $Y^{(k)}$ and $Y^{(j)}$ is

D_{K L} (Y^{(k)} ‖ Y^{(j)}) = \frac{1}{2} {‖ X^{(k)} - X^{(j)} ‖}_{F}^{2} = \frac{1}{2} ‖ (I_{p_{2} \dots p_{t}} \otimes U_{1}) \dots (I_{p_{i}} \otimes U_{i - 1}) (U_{i}^{(k)} - U_{i}^{(j)}) \cdot V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}}) ‖_{F}^{2} \leq \frac{1}{2} {‖ (I_{p_{2} \dots p_{t}} \otimes U_{1}) \dots (I_{p_{t}} \otimes U_{i - 1}) ‖}^{2} \cdot {‖ V_{i + 1}^{⊤} (V_{i + 2}^{⊤} \otimes I_{p_{i + 1}}) \dots (V_{d}^{⊤} \otimes I_{p_{i + 1} \dots p_{d - 1}}) ‖}^{2} \cdot {‖ U_{i}^{(k)} - U_{i}^{(j)} ‖}_{F}^{2} \leq \frac{1}{2} s_{1}^{2} (U_{1}) \dots s_{1}^{2} (U_{i - 1}) s_{1}^{2} (V_{i + 1}) \dots s_{1}^{2} (V_{d}) {‖ U_{i}^{(k)} - U_{i}^{(j)} ‖}_{F}^{2} \leq \frac{1}{2} \prod_{h = 1}^{i - 1} (4 p_{h} r_{h - 1}) \prod_{l = i + 1}^{d} (4 p_{l} r_{l}) \cdot {({‖ U_{i}^{(k)} - U_{0} ‖}_{F} + {‖ U_{i}^{(k)} - U_{0} ‖}_{F})}^{2} \leq C (\prod_{h = 1}^{i - 1} (p_{h} r_{h - 1}) \prod_{l = i + 1}^{d} (p_{l} r_{l})) ε^{2} .

By generalized Fano’s Lemma,

inf_{\hat{X}} sup_{X \in {X^{(k)}}_{k = 1}^{m}} E ‖ \hat{X} - X ‖_{F} \geq c \sqrt{\prod_{h = 1}^{i - 1} p_{h} r_{h - 1} \prod_{l = i + 1}^{d} p_{l} r_{l} α ε} \cdot (1 - \frac{C (\prod_{h = 1}^{i - 1} (p_{h} r_{h - 1}) \prod_{l = i + 1}^{d} (p_{l} r_{l})) ε^{2} + log 2}{r_{i} (p_{i} r_{i - 1} - r_{i}) log (c_{0} / α)}) .

By setting $ε = c^{'} \sqrt{\frac{r_{i} (p_{i} r_{i - 1} - r_{i})}{C \prod_{h = 1}^{i - 1} (p_{h} r_{h - 1}) \prod_{l = i + 1}^{d} (p_{l} r_{l})}} \leq \frac{1}{2}$ , α = (c₀ ^ 1)/8, we know that for any 1 ≤ i ≤ d − 1,

inf_{\hat{X}} sup_{X \in F_{p, r} (λ)} E ‖ \hat{X} - X ‖_{F}^{2} \geq {(inf_{\hat{X}} sup_{X \in {X^{(k)}}_{k = 1}^{m}} E ‖ \hat{X} - X ‖_{F})}^{2} \geq c_{1} r_{i} p_{i} r_{i - 1} .

For i = d, similarly to the case i = 1, we have

inf_{\hat{X}} sup_{X \in F_{p, r} (λ)} E ‖ \hat{X} - X ‖_{F}^{2} \geq c_{1} p_{d} r_{d - 1} .

Therefore, we have proved Theorem IV.2.

F. Proof of Proposition V.1

Define ${\tilde{G}}_{1} \in ℝ^{p \times r_{1}}$ , ${\tilde{G}}_{k} \in ℝ^{r_{k - 1} \times p \times r_{k}}$ , ${\tilde{G}}_{d} \in ℝ^{p \times r_{d - 1}}$ such that

{\tilde{G}}_{1, [i, l]} = {(G_{1} (i))}_{l}, \forall i \in [p], l \in [r_{1}], {\tilde{G}}_{k, [j, i, l]} = {(G_{k} (i, e_{j}^{(r_{k - 1})}))}_{l}, \forall i \in [p], j \in [r_{k - 1}], l \in [r_{k}], 2 \leq k \leq d - 1, {\tilde{G}}_{d, [i, l]} = G_{d} (i, e_{l}^{(r_{d - 1})}), \forall i \in [p], l \in [r_{d - 1}]

where $e_{i}^{(k)}$ is the i-th canonical basis of $ℝ^{k}$ . Then

{\tilde{P}}_{1} (X_{t + 1}) = {\tilde{G}}_{1, [X_{t + 1}, :]}^{⊤} \in ℝ^{r_{1}}, {\tilde{P}}_{2} (X_{t + 1}, X_{t + 2}) = G_{2} (X_{t + 2}, {\tilde{P}}_{1} (X_{t + 1})) \overset{linear map}{=} \sum_{j = 1}^{r_{1}} G_{2} (X_{t + 2}, e_{j}^{(r_{1})}) {({\tilde{P}}_{1} (X_{t + 1}))}_{j} = {({\tilde{G}}_{1, [X_{t + 1}, :]} {\tilde{G}}_{2, [:, X_{t + 2}, :]})}^{⊤} .

By induction, for any 2 ≤ k ≤ d − 1,

{\tilde{P}}_{k} (X_{t + 1}, \dots, X_{t + k}) = G_{k} (X_{t + k}, {\tilde{P}}_{k - 1} (X_{t + 1}, \dots, X_{t + k - 1})) \overset{linear map}{=} \sum_{j = 1}^{r_{k - 1}} G_{k} (X_{t + k}, e_{j}^{(r_{k - 1})}) {({\tilde{P}}_{k - 1} (X_{t + 1}, \dots, X_{t + k - 1}))}_{j} = {\tilde{G}}_{k, [:, X_{t + k}, :]}^{⊤} {\tilde{P}}_{k - 1} (X_{t + 1}, \dots, X_{t + k - 1}) = {({\tilde{G}}_{1, [X_{t + 1}, :]} {\tilde{G}}_{2, [:, X_{t + 2}, :]} \dots {\tilde{G}}_{k, [:, X_{t + k}, :]})}^{⊤}

and

ℙ (X_{t + d} ∣ X_{t + 1}, \dots, X_{t + d - 1}) = G_{d} (X_{t + d}, {\tilde{P}}_{d - 1} (X_{t + 1}, \dots, X_{t + d - 1})) = {\tilde{P}}_{d - 1}^{⊤} (X_{t + 1}, \dots, X_{t + d - 1}) {\tilde{G}}_{d, [X_{t + d, :}]}^{⊤} = {\tilde{G}}_{1, [X_{t + 1}, :]} {\tilde{G}}_{2, [:, X_{t + 2}, :]} \dots {\tilde{G}}_{d - 1, [:, X_{t + d - 1}, :]} {\tilde{G}}_{d, [X_{t + d, :}]}^{⊤} .

Therefore,

P = 〚 {\tilde{G}}_{1}, {\tilde{G}}_{2}, \dots, {\tilde{G}}_{d - 1}, {\tilde{G}}_{d} 〛

and has TT-rank (r₁, …, r_d−1).

G. Proof of Proposition V.2

Let $Z = {\hat{P}}^{emp} - P$ , then $E Z = 0$ . Let

T_{i_{1}, \dots, i_{d}}^{(k)} = 1_{{X (i_{1}, \dots, i_{d - 1}; k) = i_{d}}}, \forall 1 \leq k \leq n; 1 \leq i_{1}, \dots, i_{d} \leq p

and

Z_{i_{1}, \dots, i_{d}}^{(k)} = T_{i_{1}, \dots, i_{d}}^{(k)} - ℙ (i_{d} ∣ i_{1}, \dots, i_{d - 1}), \forall 1 \leq k \leq n; 1 \leq i_{1}, \dots, i_{d} \leq p .

Then $E Z^{(k)} = 0$ . Moreover, by definition, for any 1 ≤ j ≤ d − 1, the rows of ${[Z^{(k)}]}_{j} \in ℝ^{p^{j} \times p^{d - j}}$ are independent, and there exists a partition ${Ω_{1}^{(j)}, \dots, Ω_{p^{d - j - 1}}^{(j)}}$ of {1, …, p^d−j} satisfying $| Ω_{1}^{(j)} | = \dots = {| Ω_{p^{d - j - 1}}^{(j)} |}^{p} = p$ , such that ${({[Z^{(k)}]}_{j})}_{[:, Ω_{1}^{(j)}]}, \dots, {({[Z^{(k)}]}_{j})}_{[:, Ω_{p^{d - j - 1}}^{(j)}]}$ are independent and

\sum_{l \in Ω_{i}^{(j)}} {({[T^{(k)}]}_{j})}_{m, l} = 1, \forall 1 \leq m \leq p^{j}, 1 \leq k \leq n .

Therefore,

\sum_{l \in Ω_{i}^{(j)}} | {({[Z^{(k)}]}_{j})}_{m, l} | \leq \sum_{l \in Ω_{i}^{(j)}} {({[T^{(k)}]}_{j})}_{m, l} + E \sum_{l \in Ω_{i}^{(j)}} {({[T^{(k)}]}_{j})}_{m, l} = 2, \forall 1 \leq m \leq p^{j}, 1 \leq k \leq n .

For any fixed $x_{1} \in ℝ^{p^{j}}$ and $x_{2} \in ℝ^{p^{d - j}}$ satisfying ∥x₁∥₂ = 1 and ∥x∥₂ = 1, we have

| \sum_{l \in Ω_{i}^{(j)}} {({[Z^{(k)}]}_{j})}_{m, l} {(x_{2})}_{l} | \leq max_{l \in Ω_{i}^{(j)}} {(x_{2})}_{l} \sum_{l \in Ω_{i}^{(j)}} | {({[Z^{(k)}]}_{j})}_{m, l} | \leq 2 max_{l \in Ω_{i}^{(j)}} {(x_{2})}_{l} \leq 2 {‖ {(x_{2})}_{Ω_{i}^{(j)}} ‖}_{2} .

By [95, Exercise 2.4], $\sum_{l \in Ω_{i}^{(j)}} {({[Z^{(k)}]}_{j})}_{m, l} {(x_{2})}_{l}$ is $2 {‖ {(x_{2})}_{Ω_{i}^{(j)}} ‖}_{2}$ -sub-Gaussian. Therefore,

x_{1}^{⊤} {[Z^{(k)}]}_{j} x_{2} = \sum_{m = 1}^{p^{j}} {(x_{1})}_{m} \sum_{i = 1}^{p^{d - j - 1}} (\sum_{l \in Ω_{i}^{(j)}} {({[Z^{(k)}]}_{j})}_{m, l} {(x_{2})}_{l})

is ${(\sum_{m = 1}^{p^{j}} {(x_{1})}_{m}^{2} \sum_{i = 1}^{p^{d - j - 1}} 4 {‖ {(x_{2})}_{Ω_{i}^{(j)}} ‖}_{2}^{2})}^{1 / 2} = 2 {‖ x_{1} ‖}_{2} {‖ x_{2} ‖}_{2} = 2-subGaussian$ . Notice that $Z = \frac{1}{n} \sum_{k = 1}^{n} Z^{(k)}$ , the Hoeffding bound [95, Proposition 2.5] shows that

ℙ (| x_{1}^{⊤} {[Z]}_{j} x_{2} | \geq t) \leq 2 exp (- \frac{n t^{2}}{8}), \forall t \geq 0.

Therefore, for any fixed $U \in O_{p^{j}, r_{j}}$ , $V \in O_{p^{d - j}, p r_{j + 1}}$ , $x \in ℝ^{r_{j}}$ , $y \in ℝ^{p r_{j + 1}}$ with ∥x∥₂ = 1 and ∥y∥₂ = 1,

ℙ (| x^{⊤} U^{⊤} {[Z]}_{j} V^{⊤} y | \geq t) \leq 2 exp (- \frac{n t^{2}}{8}), \forall t \geq 0.

Similarly to the proof of (49), with probability at least 1 − Ce^−cp, for all 1 ≤ k ≤ d − 1,

‖ {\hat{U}}_{k}^{(0) ⊤} (I_{p} \otimes {\hat{U}}_{k - 1}^{(0) ⊤}) \dots (I_{p^{k - 1}} \otimes {\hat{U}}_{1}^{(0) ⊤}) {[Z]}_{k} \cdot ({\hat{V}}_{d}^{(1)} \otimes I_{p^{d - k - 1}}) \dots ({\hat{V}}_{k + 2}^{(1)} \otimes I_{p}) ‖ \leq C \sqrt{\frac{\sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1}}{n}} .

Similarly, with probability at least 1 − Ce^−cp,

‖ {[Z]}_{1} ({\hat{V}}_{d}^{(1)} \otimes I_{p^{d - 2}}) \dots ({\hat{V}}_{3}^{(1)} \otimes I_{p}) {\hat{V}}_{2}^{(1)} ‖ \leq C \sqrt{\frac{\sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1}}{n}} .

Notice that ${‖ X ‖}_{F} \leq \sqrt{r} ‖ X ‖$ if rank(X) = r, by the previous two inequalities and Theorem III.1, we know that with probability at least 1 − Ce^−cp,

{‖ {\hat{P}}^{(1)} - P ‖}_{F}^{2} \leq C (max_{1 \leq i \leq d - 1} r_{i}) \frac{\sum_{i = 1}^{d} p_{i} r_{i} r_{i - 1}}{n} .

Finally, by the definition of $\hat{P}$ , we have

‖ \hat{P} - P ‖_{F} \leq {‖ {\hat{P}}^{(1)} - P ‖}_{F} + {‖ {\hat{P}}^{(1)} - \hat{P} ‖}_{F} \leq 2 {‖ {\hat{P}}^{(1)} - P ‖}_{F},

which has finished the proof of Theorem V.2.

H. Proof of Lemma III.3

By symmetry, we only need to prove (6). By definition, (6) holds for k = 1. Suppose it holds for k = j. For k = j + 1, since $S_{j + 1} \in ℝ^{(r_{j} p_{j + 1}) \times (p_{j + 2} \dots p_{d})}$ is realigned from ${\tilde{S}}_{j} = M_{j}^{⊤} S_{j} \in ℝ^{r_{j} \times (p_{j + 1} \dots p_{d})}$ , Lemma III.2 that $S_{j + 1} = (I_{p_{j + 1}} \otimes {\tilde{S}}_{j}) A^{(p_{j + 1}, p_{j + 2} \dots p_{d})}$ where the realignment matrix A^(i,j) is defined in (5). Therefore,

S_{j + 1} = (I_{p_{j + 1}} \otimes {\tilde{S}}_{j}) A^{(p_{j + 1}, p_{j + 2} \dots p_{d})} = (I_{p_{j + 1}} \otimes M_{j}^{⊤} S_{j}) A^{(p_{j + 1}, p_{j + 2} \dots p_{d})} = (I_{p_{j + 1}} \otimes M_{j}^{⊤}) (I_{p_{j + 1}} \otimes S_{j}) A^{(p_{j + 1}, p_{j + 2} \dots p_{d})} = (I_{p_{j + 1}} \otimes M_{j}^{⊤}) \cdot (I_{p_{j + 1}} \otimes ((I_{p_{j}} \otimes M_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j}} \otimes M_{1}^{⊤}) {[T]}_{j})) \cdot A^{(p_{j + 1}, p_{j + 2} \dots p_{d})} = (I_{p_{j + 1}} \otimes M_{j}^{⊤}) (I_{p_{j + 1}} \otimes (I_{p_{j}} \otimes M_{j - 1}^{⊤})) \dots (I_{p_{j + 1}} \otimes (I_{p_{2} \dots p_{j}} \otimes M_{1}^{⊤})) (I_{p_{j + 1}} \otimes {[T]}_{j}) \cdot A^{(p_{j + 1}, p_{j + 2} \dots p_{d})} = (I_{p_{j + 1}} \otimes M_{j}^{⊤}) (I_{p_{j} p_{j + 1}} \otimes M_{j - 1}^{⊤}) \dots (I_{p_{2} \dots p_{j + 1}} \otimes M_{1}^{⊤}) {[T]}_{j + 1} .

The third equation and the fifth equation hold since (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD); the last equation holds since $Y_{j + 1} = (I_{p_{j + 1}} \otimes Y_{j}) A^{(p_{j + 1}, p_{j + 2} \dots p_{d})}$ and A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C.

Also notice that ${\tilde{S}}_{k} = M_{k}^{⊤} S_{k}$ , we have finished the proof of (6).

I. Technical Lemmas

We collect the additional technical lemmas in this section.

Lemma A.1.

Suppose $A \in ℝ^{m_{1} \times m_{2}}$ , $B \in ℝ^{m_{2} \times m_{3}}$ , where m₁ ≥ m₂. Then
$s_{min {m_{2}, m_{3}}} (A B) \geq s_{m_{2}} (A) s_{min {m_{2}, m_{3}}} (B) .$
Suppose $A \in ℝ^{m \times p_{1}}$ , $B \in ℝ^{n \times p_{2}}$ , $X \in ℝ^{p_{1} \times p_{2}}$ , rank(X) = r, p₁ ≥ m, p₂ ≥ n. If $X = U_{1} M V_{1}^{⊤}$ , where $U_{1} \in O_{p_{1}, m}$ , and $V_{1} \in O_{p_{2}, n}$ , then
$σ_{r} (A X B) \geq s_{min} (A U_{1}) σ_{r} (X) s_{min} (V_{1}^{⊤} B) .$

Proof of Lemma A.1. (1) Consider the SVD decomposition $A = U_{A} Σ_{A} V_{A}^{⊤}$ , $B = U_{B} Σ_{B} V_{B}^{⊤}$ , where $U_{A} \in O_{m_{1}, m_{2}}$ , $V_{A} \in O_{m_{2}}$ , $U_{B} \in O_{m_{2}, min {m_{2}, m_{3}}}$ , $V_{B} \in O_{min {m_{2}, m_{3}}, m_{3}}$ , $Σ_{A} = diag (σ_{1} (A), \dots, s_{m_{2}} (A))$ and $Σ_{B} = diag (s_{1} (B), \dots, s_{min {m_{2}, m_{3}}} (B))$ are diagonal matrices with nonnegative diagonal entries. Then

s_{min {m_{2}, m_{3}}} (A B) = s_{min {m_{2}, m_{3}}} (U_{A} Σ_{A} V_{A}^{⊤} U_{B} Σ_{B} V_{B}^{⊤}) = s_{min {m_{2}, m_{3}}} (Σ_{A} V_{A}^{⊤} U_{B} Σ_{B}) .

For any $x \in ℝ^{min {m_{2}, m_{3}}}$ satisfying ∥x∥₂ = 1, we have

{‖ Σ_{A} V_{A}^{⊤} U_{B} Σ_{B} x ‖}_{2} \geq s_{m_{2}} (A) {‖ V_{A}^{⊤} U_{B} Σ_{B} x ‖}_{2} = s_{m_{2}} (A) {‖ Σ_{B} x ‖}_{2} \geq s_{m_{2}} (A) s_{min {m_{2}, m_{3}}} (B) .

Therefore

s_{min {m_{2}, m_{3}}} (A B) = s_{min {m_{2}, m_{3}}} (Σ_{A} V_{A}^{⊤} U_{B} Σ_{B}) \geq s_{m_{2}} (A) s_{min {m_{2}, m_{3}}} (B) .

(2) Consider the SVD decomposition X = UΣV^⊤, where $U \in O_{p_{1}, r}$ , $V \in O_{p_{2}, r}$ and Σ is a diagonal matrix. Then we know that there exist two matrices $L \in ℝ^{m \times r}$ and $R \in ℝ^{n \times r}$ satisfying U = U₁L and V = V₁R. Moreover,

L^{⊤} L = L^{⊤} U_{1}^{⊤} U_{1} L = U^{⊤} U = I_{r}, R^{⊤} R = R^{⊤} V_{1}^{⊤} V_{1} R = V^{⊤} V = I_{r} .

Therefore,

σ_{r} (A X B) = σ_{r} (A U_{1} L Σ R^{⊤} V_{1}^{⊤} B) \geq s_{min} (A U_{1}) σ_{r} (L Σ R^{⊤}) s_{min} (V_{1}^{⊤} B) = s_{min} (A U_{1}) σ_{r} (X) s_{min} (V_{1}^{⊤} B) .

□

Lemma A.2. Suppose Z is a matrix with independent zero-mean σ-sub-Gaussian entries, d is a fixed number, r₀ = r_d = 1.

Suppose $Z \in ℝ^{p \times q}$ , $A \in ℝ^{m \times p}$ , $B \in ℝ^{q \times n}$ satisfy ∥A∥, ∥B∥ ≤ 1, m ≤ p, n ≤ q. Then
$ℙ (‖ A Z B ‖ \geq 2 σ \sqrt{m + t}) \leq 2 \cdot 5^{n} exp [- c min (\frac{t^{2}}{m}, t)] .$ (47)

$ℙ (‖ A Z B ‖_{F} \geq σ \sqrt{m n + t}) \leq 2 exp [- c min (\frac{t^{2}}{m n}, t)] .$ (48)
Suppose $Z \in ℝ^{(p_{1} \dots p_{k}) \times m}$ , 2 ≤ k ≤ d − 1. Then
$max_{\underset{‖ U_{i} ‖ \leq 1}{U_{i} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}}}} ‖ (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z ‖ \leq C σ \sqrt{\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i} + p_{k} r_{k - 1} + m} .$ (49)
with probability at least $1 - C exp (- c (\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i} + p_{k} r_{k - 1} + m))$ .
Suppose $Z \in ℝ^{(p_{1} \dots p_{k}) \times (p_{k + 1} \dots p_{d})}$ , 2 ≤ k ≤ d − 2. Then
${max}_{(U_{1}, \dots, V_{d}) \in A} ‖ U_{k}^{⊤} (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z \cdot (V_{d} \otimes I_{p_{k + 1} \dots p_{d - 1}}) \dots (V_{k + 2} \otimes I_{p_{k + 1}}) ‖ \leq C σ \sqrt{\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}}$ (50)
with probability at least $1 - C exp (- c \sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i})$ . Here,
$A = {(U_{1}, \dots, U_{k}, V_{k + 2}, \dots, V_{d}) : U_{i} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}}, ‖ U_{i} ‖ \leq 1, V_{j} \in ℝ^{(p_{i} r_{i}) \times r_{i - 1}}, ‖ V_{j} ‖ \leq 1} .$ (51)
Suppose $Z \in ℝ^{(p_{1} \dots p_{d - 1}) \times p_{d}}$ . Then with probability at least $1 - C exp (- c \sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i})$ ,
$max_{U_{i} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i},} ‖ U_{i} ‖ \leq 1} {‖ U_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes U_{d - 2}^{⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes U_{1}^{⊤}) Z ‖}_{F} \leq C σ \sqrt{\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}} .$ (52)
Suppose $Z \in ℝ^{(p_{1} \dots p_{k}) \times (p_{k + 1} \dots p_{d})}$ , 2 ≤ k ≤ d − 2. Then
$max_{(U_{1}, \dots, V_{d}) \in A} {‖ U_{k}^{⊤} (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z \cdot (V_{d} \otimes I_{p_{k + 1} \dots p_{d - 1}}) \dots (V_{k + 2} \otimes I_{p_{k + 1}}) ‖}_{F} \leq C σ \sqrt{\sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}}$ (53)
with probability at least $1 - C exp (- c \sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i})$ . Here, $A$ is defined in (51).

Proof of Lemma A.2. W.O.L.G., assume σ = 1.

For fixed

x \in ℝ^{n}

satisfying ∥x∥₂ = 1, we have AZBx = (x^⊤B^⊤ ⊗ A)vec(Z). Since Z_ij is 1-sub-Gaussian, we know that Var(Z_ij) ≤ 1. In addition,

E {‖ (x^{⊤} B^{⊤} \otimes A) vec (Z) ‖}_{2}^{2} = E [tr (vec {(Z)}^{⊤} {(x^{⊤} B^{⊤} \otimes A)}^{⊤} (x^{⊤} B^{⊤} \otimes A) vec (Z))] = tr [E ({(x^{⊤} B^{⊤} \otimes A)}^{⊤} (x^{⊤} B^{⊤} \otimes A) vec (Z) vec {(Z)}^{⊤})] = tr [{(x^{⊤} B^{⊤} \otimes A)}^{⊤} (x^{⊤} B^{⊤} \otimes A) E (vec (Z) vec {(Z)}^{⊤})] \leq tr ({(x^{⊤} B^{⊤} \otimes A)}^{⊤} (x^{⊤} B^{⊤} \otimes A)) = {‖ x^{⊤} B^{⊤} \otimes A ‖}_{F}^{2} = ‖ B x ‖_{2}^{2} ‖ A ‖_{F}^{2} \leq ‖ x ‖_{2}^{2} ‖ A ‖_{F}^{2} \leq m .

(54)

The first inequality holds since

E (vec (Z) vec {(Z)}^{⊤})

is a diagonal matrix with diagonal entries Var(Z_ij) ≤ 1; the last inequality is due to ∥A∥_F ≤ min{m, p}∥A∥₂ ≤m. By Hanson-Wright inequality, we have

ℙ (‖ AZBx ‖_{2}^{2} - m \geq t) \leq 2 exp [- c min (\frac{t^{2}}{{‖ (B x x^{⊤} B^{⊤}) \otimes (A^{⊤} A) ‖}_{F}^{2}}, \frac{t}{‖ (B x x^{⊤} B^{⊤}) \otimes (A^{⊤} A) ‖})] .

Since ∥x∥₂ = 1 and ∥A∥, ∥B∥ ≤ 1,

{‖ (B x x^{⊤} B^{⊤}) \otimes (A^{⊤} A) ‖}_{F}^{2} = {‖ B x x^{⊤} B^{⊤} ‖}_{F}^{2} {‖ A^{⊤} A ‖}_{F}^{2} = {(x^{⊤} B^{⊤} B x)}^{2} {‖ A^{⊤} A ‖}_{F}^{2} \leq {(x^{⊤} x)}^{2} {‖ A^{⊤} A ‖}_{F}^{2} = \sum_{i = 1}^{min {m, p}} σ_{i}^{4} (A) \leq m, ‖ (B x x^{⊤} B^{⊤}) \otimes (A^{⊤} A) ‖ \leq ‖ B x x^{⊤} B^{⊤} ‖ ‖ A^{⊤} A ‖ \leq ‖ x x^{⊤} ‖ ‖ A^{⊤} A ‖ \leq 1.

Thus, for fixed x satisfying ∥x∥₂ = 1, we have

ℙ (‖ AZBx ‖_{2}^{2} \geq m + t) \leq 2 exp [- c min (\frac{t^{2}}{m}, t)] .

(55)

By [93][Lemma 5.2], there exists

N_{1 / 2}

, a 1/2-net of

{x \in ℝ^{n} : ∥ x ∥_{2} = 1}

, such that

| N_{1 / 2} | \leq 5^{n}

. The union bound, [93][Lemma 5.2] and (55) together imply that

ℙ (‖ A Z B ‖ \geq 2 \sqrt{m + t}) \leq ℙ (max_{x \in N_{1 / 2}} ‖ AZBx ‖_{2} \geq \sqrt{m + t}) \leq 2 \cdot 5^{n} exp [- c min (\frac{t^{2}}{m}, t)] .

For ∥AZB∥_F, note that AZB = (B^⊤⊗A)vec(Z), Similarly to (54), we have

E {‖ (B^{⊤} \otimes A) vec (Z) ‖}_{2}^{2} = E [vec {(Z)}^{⊤} {(B^{⊤} \otimes A)}^{⊤} (B^{⊤} \otimes A) vec (Z)] = E {tr [vec {(Z)}^{⊤} {(B^{⊤} \otimes A)}^{⊤} (B^{⊤} \otimes A) vec (Z)]} = tr {E [{(B^{⊤} \otimes A)}^{⊤} (B^{⊤} \otimes A) vec (Z) vec {(Z)}^{⊤}]} = tr [{(B^{⊤} \otimes A)}^{⊤} (B^{⊤} \otimes A) E (vec (Z) vec {(Z)}^{⊤})] \leq tr [{(B^{⊤} \otimes A)}^{⊤} (B^{⊤} \otimes A)] = {‖ B^{⊤} \otimes A ‖}_{F}^{2} = ‖ B ‖_{F}^{2} ‖ A ‖_{F}^{2} \leq m n .

By Hanson-Wright inequality, we have

ℙ (‖ A Z B ‖_{F}^{2} - m n \geq t) \leq 2 exp [- c min (\frac{t^{2}}{{‖ (B B^{⊤}) \otimes (A^{⊤} A) ‖}_{F}^{2}}, \frac{t}{{‖ (B B^{⊤}) \otimes (A^{⊤} A) ‖}^{⊤}})] .

Since ∥A∥, ∥B∥ ≤ 1, we have

{‖ (B B^{⊤}) \otimes (A^{⊤} A) ‖}_{F} = \sqrt{{‖ A^{⊤} A ‖}_{F}^{2} {‖ B B^{⊤} ‖}_{F}^{2}} = \sqrt{\sum_{i = 1}^{min {m, p}} σ_{i}^{4} (A) \sum_{i = 1}^{min {q, n}} σ_{i}^{4} (B)} \leq \sqrt{m n}, ‖ (B B^{⊤}) \otimes (A^{⊤} A) ‖ \leq 1.

Therefore,

ℙ (‖ A Z B ‖_{F}^{2} \geq m n + t) \leq 2 exp [- c min (\frac{t^{2}}{m n}, t)] .

For fixed

x \in ℝ^{m}

and

A \in ℝ^{(p_{k} r_{k - 1}) \times (p_{1} \dots p_{k})}

satisfying ∥x∥₂ = 1 and ∥A∥ ≤ 1, by (47) with B = I_m, we have

ℙ (‖ A Z ‖ \geq 2 \sqrt{p_{k} r_{k - 1} + t}) \leq 2 \cdot 5^{m} exp [- c min (\frac{t^{2}}{p_{k} r_{k - 1}}, t)] .

(56)

By [48][Lemma 7], for 1 ≤ i ≤ k − 1, that exist ε-nets:

U_{i}^{(1)}, \dots, U_{i}^{(N_{i})} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}}

(here r₀ = 1),

N_{i} \leq {((2 + ε) / ε)}^{(p_{i} r_{i - 1}) \times r_{i}}

, such that

\forall U \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}} satisfying ‖ U ‖ \leq 1, \exists 1 \leq j \leq N_{i} s.t. ‖ U_{i}^{(j)} - U ‖ \leq ε .

Therefore,

ℙ (max_{i_{1}, \dots, i_{k - 1}} ‖ (I_{p_{k}} \otimes U_{k - 1}^{(i_{k - 1}) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{(i_{1}) ⊤}) Z ‖ \geq 2 \sqrt{p_{k} r_{k - 1} + t}) \leq 2 {((2 + ε) / ε)}^{\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i}} 5^{m} exp [- c min (\frac{t^{2}}{p_{k} r_{k - 1}}, t)] .

(57)

Let

U_{1}^{*}, \dots, U_{k - 1}^{*} \in \underset{\underset{‖ U_{i} ‖ \leq 1, 1 \leq i \leq k - 1}{U_{i} \in ℝ^{(p_{i} r_{i} - 1) \times r_{i}},}}{arg max} ‖ (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z ‖, M = max_{U_{i} \in ℝ^{(p_{i} r_{i} - 1) \times r_{i}}, ‖ U_{i} ‖ \leq 1, 1 \leq i \leq k - 1} ‖ (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z ‖ .

Then for any 1 ≤ i ≤ k − 1, there exists 1 ≤ j_i ≤ N_i, such that

‖ U_{i}^{(j_{i})} - U_{i}^{*} ‖ \leq ε

. Then

M = ‖ (I_{p_{k}} \otimes U_{k - 1}^{* ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{* ⊤}) Z ‖ \leq ‖ (I_{p_{k}} \otimes U_{k - 1}^{(j_{k - 1}) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{(j_{1}) ⊤}) Z ‖ + ‖ {(I_{p_{k}} \otimes (U_{k - 1}^{*} - U_{k - 1}^{(j_{k - 1})}))}^{⊤} (I_{p_{k - 1} p_{k}} \otimes U_{k - 2}^{(j_{k - 2}) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{(j_{1}) ⊤}) Z ‖ + \dots + ‖ (I_{p_{k}} \otimes U_{k - 1}^{* ⊤}) \dots (I_{p_{3} \dots p_{k}} \otimes U_{2}^{* ⊤}) \cdot (I_{p_{2} \dots p_{k}} \otimes {(U_{1}^{*} - U_{1}^{(j_{1})})}^{⊤}) Z ‖ \leq ‖ (I_{p_{k}} \otimes U_{k - 1}^{(j_{k - 1}) ⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{(j_{1}) ⊤}) Z ‖ + ε (k - 1) M .

(58)

Combine (57) and the previous inequality together, we have

ℙ (M \geq \frac{2 \sqrt{p_{k} r_{k - 1} + t}}{1 - (k - 1) ε}) \leq 2 {((2 + ε) / ε)}^{\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i}} 5^{m} exp [- c min (\frac{t^{2}}{p_{k} r_{k - 1}}, t)] .

(59)

By setting

ε = \frac{1}{2 (k - 1)}

and

t = C \sqrt{\sum_{i = 1}^{k - 1} p_{i} r_{i - 1} r_{i} + p_{k} r_{k - 1} + m}

, we have proved (49).

For fixed $A \in ℝ^{r_{k} \times (p_{1} \dots p_{k})}$ , $B \in ℝ^{(p_{k + 1} \dots p_{d}) \times (p_{k + 1} r_{k + 1})}$ satisfying ∥A∥ ≤ 1, ∥B∥ ≤ 1, by (47), we have
$ℙ (‖ A Z B ‖ \geq 2 \sqrt{r_{k} + t}) \leq 2 \cdot 5^{p_{k + 1} r_{k + 1}} exp [- c min (\frac{t^{2}}{r_{k}}, t)] .$
Let
$M = max_{(U_{1}, \dots, V_{d}) \in A} ‖ U_{k}^{⊤} (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z \cdot (V_{d} \otimes I_{p_{k + 1} \dots p_{d - 1}}) \dots (V_{k + 2} \otimes I_{p_{k + 1}}) ‖,$
By similar arguments as (59), one has
$ℙ (M \geq \frac{2 \sqrt{r_{k} + t}}{1 - (d - 1) ε}) \leq 2 {((2 + ε) / ε)}^{\sum_{1 \leq i \leq d, i \neq k + 1} p_{i} r_{i - 1} r_{i}} 5^{p_{k + 1} r_{k + 1}} \cdot exp [- c min (\frac{t^{2}}{r_{k}}, t)]$
for any $0 < ϵ < \frac{1}{d}$ . By setting $ε = \frac{1}{2 (d - 1)}$ and $t = C \sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}$ , we have proved the third part of Lemma A.2.
For fixed U₁, …, U_d−1 satisfying ∥U_i∥ ≤ 1, let $A = U_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes U_{d - 2}^{⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes U_{1}^{⊤}) \in ℝ^{r_{d - 1} \times (p_{1} \dots p_{d - 1})}$ , then ∥A∥ ≤ 1. By (48) with $B = I_{p_{d}}$ , we have
$ℙ (‖ A Z ‖_{F}^{2} \geq p_{d} r_{d - 1} + t) \leq 2 exp [- c min (\frac{t^{2}}{p_{d} r_{d - 1}}, t)] .$
Let
$M = max_{U_{i} \in ℝ^{(p_{i} r_{i - 1}) \times r_{i}}, ‖ U_{i} ‖ \leq 1} {‖ U_{d - 1}^{⊤} (I_{p_{d - 1}} \otimes U_{d - 2}^{⊤}) \dots (I_{p_{2} \dots p_{d - 1}} \otimes U_{1}^{⊤}) Z ‖}_{F} .$
The similar proof of (59) leads us to
$ℙ (M^{2} \geq \frac{r_{d - 1} p_{d} + t}{{(1 - ε (d - 1))}^{2}}) \leq 2 {((2 + ε) / ε)}^{\sum_{k = 1}^{d - 1} p_{k} r_{k - 1} r_{k}} exp [- c min (\frac{t^{2}}{p_{d} r_{d - 1}}, t)] .$ (60)
for $0 < ε < \frac{1}{d - 1}$ . By setting $ε = \frac{1}{2 (d - 1)}$ and $t = C \sum_{k = 1}^{d} p_{k} r_{k - 1} r_{k}$ , we have arrived at (52).
For fixed $A \in ℝ^{r_{k} \times (p_{1} \dots p_{k})}$ , $B \in ℝ^{(p_{k + 1} \dots p_{d}) \times (p_{k + 1} r_{k + 1})}$ , ∥A∥ ≤ 1, ∥B∥ ≤ 1, by (48), we have
$ℙ (‖ A Z B ‖_{F}^{2} \geq p_{k + 1} r_{k + 1} r_{k} + t) \leq 2 exp [- c min (\frac{t^{2}}{p_{k + 1} r_{k + 1} r_{k}}, t)] .$
Let
$M = max_{(U_{1}, \dots, V_{d}) \in A} {‖ U_{k}^{⊤} (I_{p_{k}} \otimes U_{k - 1}^{⊤}) \dots (I_{p_{2} \dots p_{k}} \otimes U_{1}^{⊤}) Z \cdot (V_{d} \otimes I_{p_{k + 1} \dots p_{d - 1}}) \dots (V_{k + 2} \otimes I_{p_{k + 1}}) ‖}_{F} .$
Similarly to (59), for any $0 < ε < \frac{1}{d - 1}$ , we have
$ℙ (M \geq \frac{\sqrt{p_{k + 1} r_{k + 1} r_{k} + t}}{1 - (d - 1) ε}) \leq 2 {((2 + ε) / ε)}^{\sum_{1 \leq i \leq d, i \neq k + 1} p_{i} r_{i - 1} r_{i}} \cdot exp [- c min (\frac{t^{2}}{p_{k + 1} r_{k + 1} r_{k}}, t)] .$ (61)
By setting $ε = \frac{1}{2 (d - 1)}$ and $t = C \sum_{i = 1}^{d} p_{i} r_{i - 1} r_{i}$ , we have proved (53). □

Lemma A.3. Suppose X, $Z \in ℝ^{p_{1} \times p_{2}}$ , rank(X) = r. Let Y = X + Z, $\hat{U} = S V D_{r}^{L} (Y)$ , $\hat{V} = S V D_{r}^{R} (Y)$ . Then we have

max {‖ {\hat{U}}_{⊥}^{⊤} X ‖, ‖ X {\hat{V}}_{⊥} ‖} \leq 2 ‖ Z ‖, max {{‖ {\hat{U}}_{⊥}^{⊤} X ‖}_{F}, {‖ X {\hat{V}}_{⊥} ‖}_{F}} \leq 2 min {‖ Z ‖_{F}, \sqrt{r} ‖ Z ‖} .

Proof of Lemma A.3. See [48, Lemma 6] and [96, Theorem 1]. □

Footnotes

2013 Trip Data, available at https://chriswhong.com/open-data/foil_nyc_taxi/

Contributor Information

Yuchen Zhou, Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA.

Anru R. Zhang, Departments of Biostatistics & Bioinformatics, Computer Science, Mathematics, and Statistical Science, Duke University, Durham, NC 27710, USA

Lili Zheng, Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA.

Yazhen Wang, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.

REFERENCES

[1].Oseledets IV, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011. [Google Scholar]
[2].Bi X, Qu A, and Shen X, “Multilayer tensor factorization with applications to recommender systems,” The Annals of Statistics, vol. 46, no. 6B, pp. 3308–3333, 2018. [Google Scholar]
[3].Nasiri M, Rezghi M, and Minaei B, “Fuzzy dynamic tensor decomposition algorithm for recommender system,” UCT Journal of Research in Science, Engineering and Technology, vol. 2, no. 2, pp. 52–55, 2014. [Google Scholar]
[4].Wozniak JR, Krach L, Ward E, Mueller BA, Muetzel R, Schnoebelen S, Kiragu A, and Lim KO, “Neurocognitive and neuroimaging correlates of pediatric traumatic brain injury: a diffusion tensor imaging (dti) study,” Archives of Clinical Neuropsychology, vol. 22, no. 5, pp. 555–568, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]
[7].Oseledets IV and Tyrtyshnikov EE, “Breaking the curse of dimensionality, or how to use svd in many dimensions,” SIAM Journal on Scientific Computing, vol. 31, no. 5, pp. 3744–3759, 2009. [Google Scholar]
[8].Cichocki A, Mandic D, De Lathauwer L, Zhou G, Zhao Q, Caiafa C, and Phan HA, “Tensor decompositions for signal processing applications: From two-way to multiway component analysis,” IEEE signal processing magazine, vol. 32, no. 2, pp. 145–163, 2015. [Google Scholar]
[9].Mondelli M and Montanari A, “On the connection between learning two-layer neural networks and tensor decomposition,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1051–1060. [Google Scholar]
[10].Zhong K, Song Z, and Dhillon IS, “Learning non-overlapping convolutional neural networks with multiple kernels,” arXiv preprint arXiv:1711.03440, 2017. [Google Scholar]
[11].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing. IEEE, 2010, pp. 517–520. [Google Scholar]
[12].Zhang C, Han R, Zhang AR, and Voyles PM, “Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition,” Ultramicroscopy, vol. 219, p. 113123, 2020. [DOI] [PubMed] [Google Scholar]
[13].Bhattacharya A and Dunson DB, “Simplex factor models for multi-variate unordered categorical data,” Journal of the American Statistical Association, vol. 107, no. 497, pp. 362–377, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Dunson DB and Xing C, “Nonparametric bayes modeling of multi-variate categorical data,” Journal of the American Statistical Association, vol. 104, no. 487, pp. 1042–1051, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Calvi GG, Moniri A, Mahfouz M, Yu Z, Zhao Q, and Mandic DP, “Tucker tensor layer in fully connected neural networks,” arXiv preprint arXiv:1903.06133, 2019. [Google Scholar]
[16].Novikov A, Podoprikhin D, Osokin A, and Vetrov DP, “Tensorizing neural networks,” in Advances in neural information processing systems, 2015, pp. 442–450. [Google Scholar]
[17].Novikov A, Rodomanov A, Osokin A, and Vetrov D, “Putting mrfs on a tensor train,” in International Conference on Machine Learning, 2014, pp. 811–819. [Google Scholar]
[18].Fannes M, Nachtergaele B, and Werner RF, “Finitely correlated states on quantum spin chains,” Communications in mathematical physics, vol. 144, no. 3, pp. 443–490, 1992. [Google Scholar]
[19].Oseledets I, “A new tensor decomposition,” in Doklady Mathematics, vol. 80, no. 1. Pleiades Publishing, Ltd., 2009, pp. 495–496. [Google Scholar]
[20].Oseledets I and Tyrtyshnikov E, “Recursive decomposition of multidimensional tensors,” in Doklady Mathematics, vol. 80, no. 1. Springer, 2009, pp. 460–462. [Google Scholar]
[21].Orús R, “Tensor networks for complex quantum systems,” Nature Reviews Physics, vol. 1, no. 9, pp. 538–550, 2019. [Google Scholar]
[22].Bravyi S, Gosset D, and Movassagh R, “Classical algorithms for quantum mean values,” Nature Physics, vol. 17, no. 3, pp. 337–341, 2021. [Google Scholar]
[23].Rakhuba M and Oseledets I, “Calculating vibrational spectra of molecules using tensor train decomposition,” The Journal of Chemical Physics, vol. 145, no. 12, p. 124101, 2016. [DOI] [PubMed] [Google Scholar]
[24].Schollwöck U, “The density-matrix renormalization group in the age of matrix product states,” Annals of physics, vol. 326, no. 1, pp. 96–192, 2011. [Google Scholar]
[25].Stoudenmire E and Schwab DJ, “Supervised learning with tensor networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4799–4807. [Google Scholar]
[26].Bigoni D, Engsig-Karup AP, and Marzouk YM, “Spectral tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 38, no. 4, pp. A2405–A2439, 2016. [Google Scholar]
[27].Oseledets I and Tyrtyshnikov E, “Tt-cross approximation for multidimensional arrays,” Linear Algebra and its Applications, vol. 432, no. 1, pp. 70–88, 2010. [Google Scholar]
[28].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, pp. 1–39, 2013. [Google Scholar]
[29].Dolgov SV and Savostyanov DV, “Alternating minimal energy methods for linear systems in higher dimensions,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. A2248–A2271, 2014. [Google Scholar]
[30].Song Z, Woodruff DP, and Zhong P, “Relative error tensor low rank approximation,” in Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019, pp. 2772–2789. [Google Scholar]
[31].Li L, Yu W, and Batselier K, “Faster tensor train decomposition for sparse data,” Journal of Computational and Applied Mathematics, vol. 405, p. 113972, 2022. [Google Scholar]
[32].Lubich C, Rohwedder T, Schneider R, and Vandereycken B, “Dynamical approximation by hierarchical tucker and tensor-train tensors,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 2, pp. 470–494, 2013. [Google Scholar]
[33].Grasedyck L, Kluge M, and Kramer S, “Variants of alternating least squares tensor completion in the tensor train format,” SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. A2424–A2450, 2015. [Google Scholar]
[34].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]
[35].Steinlechner MM, “Riemannian optimization for solving highd-imensional problems with low-rank tensor structure,” EPFL, Tech. Rep, 2016. [Google Scholar]
[36].Novikov A, Izmailov P, Khrulkov V, Figurnov M, and Oseledets IV, “Tensor train decomposition on tensorflow (t3f).” Journal of Machine Learning Research, vol. 21, no. 30, pp. 1–7, 2020.34305477 [Google Scholar]
[37].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60–89, 2018. [Google Scholar]
[38].Candes EJ, Sing-Long CA, and Trzasko JD, “Unbiased risk estimates for singular value thresholding and spectral estimators,” IEEE transactions on signal processing, vol. 61, no. 19, pp. 4643–4657, 2013. [Google Scholar]
[39].Donoho D and Gavish M, “Minimax risk of matrix denoising by singular value thresholding,” The Annals of Statistics, vol. 42, no. 6, pp. 2413–2440, 2014. [Google Scholar]
[40].Cai J-F, Candès EJ, and Shen Z, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on optimization, vol. 20, no. 4, pp. 1956–1982, 2010. [Google Scholar]
[41].Chatterjee S, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, 2015. [Google Scholar]
[42].Klopp O, “Matrix completion by singular value thresholding: sharp bounds,” Electronic journal of statistics, vol. 9, no. 2, pp. 2348–2369, 2015. [Google Scholar]
[43].Zhang H, Cheng L, and Zhu W, “A lower bound guaranteeing exact matrix completion via singular value thresholding algorithm,” Applied and Computational Harmonic Analysis, vol. 31, no. 3, pp. 454–459, 2011. [Google Scholar]
[44].Nadler B, “Finite sample approximation results for principal component analysis: A matrix perturbation approach,” The Annals of Statistics, vol. 36, no. 6, pp. 2791–2817, 2008. [Google Scholar]
[45].Zhang A and Wang M, “Spectral state compression of markov processes,” IEEE Transactions on Information Theory, vol. 66, no. 5, pp. 3202–3231, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].De Lathauwer L, De Moor B, and Vandewalle J, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000. [Google Scholar]
[47].——, “On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000. [Google Scholar]
[48].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]
[49].Vannieuwenhoven N, Vandebril R, and Meerbergen K, “A new truncation strategy for the higher-order singular value decomposition,” SIAM Journal on Scientific Computing, vol. 34, no. 2, pp. A1027–A1052, 2012. [Google Scholar]
[50].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, pp. 1–34, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Kolda TG and Bader BW, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009. [Google Scholar]
[52].Sharan V and Valiant G, “Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use,” in International Conference on Machine Learning, 2017, pp. 3095–3104. [Google Scholar]
[53].Leurgans SE, Ross RT, and Abel RB, “A decomposition for three-way arrays,” SIAM Journal on Matrix Analysis and Applications, vol. 14, no. 4, pp. 1064–1083, 1993. [Google Scholar]
[54].Rajih M, Comon P, and Harshman RA, “Enhanced line search: A novel method to accelerate parafac,” SIAM journal on matrix analysis and applications, vol. 30, no. 3, pp. 1128–1147, 2008. [Google Scholar]
[55].Colombo N and Vlassis N, “Tensor decomposition via joint matrix schur decomposition,” in International Conference on Machine Learning, 2016, pp. 2820–2828. [Google Scholar]
[56].Anandkumar A, Deng Y, Ge R, and Mobahi H, “Homotopy analysis for tensor pca,” in Conference on Learning Theory. PMLR, 2017, pp. 79–104. [Google Scholar]
[57].Arous GB, Mei S, Montanari A, and Nica M, “The landscape of the spiked tensor model,” Communications on Pure and Applied Mathematics, vol. 72, no. 11, pp. 2282–2330, 2019. [Google Scholar]
[58].Hopkins SB, Shi J, and Steurer D, “Tensor principal component analysis via sum-of-square proofs,” in Conference on Learning Theory, 2015, pp. 956–1006. [Google Scholar]
[59].Luo Y and Zhang AR, “Tensor clustering with planted structures: Statistical optimality and computational limits,” arXiv preprint arXiv:2005.10743, 2020. [Google Scholar]
[60].Perry A, Wein AS, and Bandeira AS, “Statistical limits of spiked tensor models,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 56, no. 1. Institut Henri Poincaré, 2020, pp. 230–264. [Google Scholar]
[61].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems, 2014, pp. 2897–2905. [Google Scholar]
[62].Lesieur T, Miolane L, Lelarge M, Krzakala F, and Zdeborová L, “Statistical and computational phase transitions in spiked tensor estimation,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 511–515. [Google Scholar]
[63].Allen G, “Sparse higher-order principal components analysis,” in Artificial Intelligence and Statistics, 2012, pp. 27–36. [Google Scholar]
[64].Allen GI, “Regularized tensor factorizations and higher-order principal components analysis,” arXiv preprint arXiv:1202.2476, 2012. [Google Scholar]
[65].Liu Y, Chen L, and Zhu C, “Improved robust tensor principal component analysis via low-rank core matrix,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1378–1389, 2018. [Google Scholar]
[66].Lu C, Feng J, Chen Y, Liu W, Lin Z, and Yan S, “Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5249–5257. [Google Scholar]
[67].——, “Tensor robust principal component analysis with a new tensor nuclear norm,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 925–938, 2019. [DOI] [PubMed] [Google Scholar]
[68].Zhou P and Feng J, “Outlier-robust tensor pca,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2263–2271. [Google Scholar]
[69].Luo Y, Raskutti G, Yuan M, and Zhang AR, “A sharp blockwise tensor perturbation bound for orthogonal iteration,” Journal of machine learning research, vol. 22, no. 179, pp. 1–48, 2021. [Google Scholar]
[70].Wein AS, El Alaoui A, and Moore C, “The kikuchi hierarchy and tensor pca,” in 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2019, pp. 1446–1468. [Google Scholar]
[71].Benson AR, Gleich DF, and Lim L-H, “The spacey random walk: A stochastic process for higher-order data,” SIAM Review, vol. 59, no. 2, pp. 321–345, 2017. [Google Scholar]
[72].Raftery AE, “A model for high-order markov chains,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 47, no. 3, pp. 528–539, 1985. [Google Scholar]
[73].Tsay RS, Analysis of financial time series. John wiley & sons, 2005, vol. 543. [Google Scholar]
[74].Zhao J and Sun S, “High-order gaussian process dynamical models for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 2014–2019, 2016. [Google Scholar]
[75].Berchtold A and Raftery AE, “The mixture transition distribution model for high-order markov chains and non-gaussian time series,” Statistical Science, pp. 328–356, 2002. [Google Scholar]
[76].Ganguly A, Petrov T, and Koeppl H, “Markov chain aggregation and its applications to combinatorial reaction networks,” Journal of mathematical biology, vol. 69, no. 3, pp. 767–797, 2014. [DOI] [PubMed] [Google Scholar]
[77].Du Z, Ozay N, and Balzano L, “Mode clustering for markov jump systems,” in 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2019, pp. 126–130. [Google Scholar]
[78].Sanders J, Proutière A, and Yun S-Y, “Clustering in block markov chains,” The Annals of Statistics, vol. to appear, 2020. [Google Scholar]
[79].Zhu Z, Li X, Wang M, and Zhang A, “Learning Markov models via low-rank optimization,” arXiv preprint arXiv:1907.00113, 2019. [Google Scholar]
[80].Kearns MJ and Singh SP, “Finite-sample convergence rates for q-learning and indirect algorithms,” in Advances in neural information processing systems, 1999, pp. 996–1002. [Google Scholar]
[81].Duchi J, Shalev-Shwartz S, Singer Y, and Chandra T, “Efficient projections onto the l 1-ball for learning in high dimensions,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 272–279. [Google Scholar]
[82].Han R, Luo Y, Wang M, and Zhang AR, “Exact clustering in tensor block model: Statistical optimality and computational limit,” arXiv preprint arXiv:2012.09996, 2020. [Google Scholar]
[83].Liu Y, Wang F, Xiao Y, and Gao S, “Urban land uses and traffic source-sink areas: Evidence from gps-enabled taxi data in shanghai,” Landscape and Urban Planning, vol. 106, no. 1, pp. 73–87, 2012. [Google Scholar]
[84].Li SZ, Markov random field modeling in image analysis. Springer Science & Business Media, 2009. [Google Scholar]
[85].Zhang Y, Brady M, and Smith S, “Segmentation of brain mr images through a hidden markov random field model and the expectation-maximization algorithm,” IEEE transactions on medical imaging, vol. 20, no. 1, pp. 45–57, 2001. [DOI] [PubMed] [Google Scholar]
[86].Wei Z and Li H, “A markov random field model for network-based analysis of genomic data,” Bioinformatics, vol. 23, no. 12, pp. 1537–1544, 2007. [DOI] [PubMed] [Google Scholar]
[87].Chaplot DS, Bhattacharyya P, and Paranjape A, “Unsupervised word sense disambiguation using markov random field and dependency parser,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Google Scholar]
[88].Wainwright MJ and Jordan MI, “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008. [Google Scholar]
[89].Duan Y, Wang M, Wen Z, and Yuan Y, “Adaptive low-nonnegative-rank approximation for state aggregation of markov chains,” SIAM Journal on Matrix Analysis and Applications, vol. 41, no. 1, pp. 244–278, 2020. [Google Scholar]
[90].Puterman ML, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]
[91].Singh SP, Jaakkola T, and Jordan MI, “Reinforcement learning with soft state aggregation,” in Advances in neural information processing systems, 1995, pp. 361–368. [Google Scholar]
[92].Sutton RS and Barto AG, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135. [Google Scholar]
[93].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
[94].Cai TT, Ma Z, and Wu Y, “Sparse pca: Optimal rates and adaptive estimation,” The Annals of Statistics, vol. 41, no. 6, pp. 3074–3110, 2013. [Google Scholar]
[95].Wainwright MJ, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019, vol. 48. [Google Scholar]
[96].Luo Y, Han R, and Zhang AR, “A schatten-q low-rank matrix perturbation analysis via perturbation projection error bound,” Linear Algebra and its Applications, vol. 630, pp. 225–240, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0024379521002962 [Google Scholar]

[R1] [1].Oseledets IV, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011. [Google Scholar]

[R2] [2].Bi X, Qu A, and Shen X, “Multilayer tensor factorization with applications to recommender systems,” The Annals of Statistics, vol. 46, no. 6B, pp. 3308–3333, 2018. [Google Scholar]

[R3] [3].Nasiri M, Rezghi M, and Minaei B, “Fuzzy dynamic tensor decomposition algorithm for recommender system,” UCT Journal of Research in Science, Engineering and Technology, vol. 2, no. 2, pp. 52–55, 2014. [Google Scholar]

[R4] [4].Wozniak JR, Krach L, Ward E, Mueller BA, Muetzel R, Schnoebelen S, Kiragu A, and Lim KO, “Neurocognitive and neuroimaging correlates of pediatric traumatic brain injury: a diffusion tensor imaging (dti) study,” Archives of Clinical Neuropsychology, vol. 22, no. 5, pp. 555–568, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]

[R7] [7].Oseledets IV and Tyrtyshnikov EE, “Breaking the curse of dimensionality, or how to use svd in many dimensions,” SIAM Journal on Scientific Computing, vol. 31, no. 5, pp. 3744–3759, 2009. [Google Scholar]

[R8] [8].Cichocki A, Mandic D, De Lathauwer L, Zhou G, Zhao Q, Caiafa C, and Phan HA, “Tensor decompositions for signal processing applications: From two-way to multiway component analysis,” IEEE signal processing magazine, vol. 32, no. 2, pp. 145–163, 2015. [Google Scholar]

[R9] [9].Mondelli M and Montanari A, “On the connection between learning two-layer neural networks and tensor decomposition,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1051–1060. [Google Scholar]

[R10] [10].Zhong K, Song Z, and Dhillon IS, “Learning non-overlapping convolutional neural networks with multiple kernels,” arXiv preprint arXiv:1711.03440, 2017. [Google Scholar]

[R11] [11].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing. IEEE, 2010, pp. 517–520. [Google Scholar]

[R12] [12].Zhang C, Han R, Zhang AR, and Voyles PM, “Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition,” Ultramicroscopy, vol. 219, p. 113123, 2020. [DOI] [PubMed] [Google Scholar]

[R13] [13].Bhattacharya A and Dunson DB, “Simplex factor models for multi-variate unordered categorical data,” Journal of the American Statistical Association, vol. 107, no. 497, pp. 362–377, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Dunson DB and Xing C, “Nonparametric bayes modeling of multi-variate categorical data,” Journal of the American Statistical Association, vol. 104, no. 487, pp. 1042–1051, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Calvi GG, Moniri A, Mahfouz M, Yu Z, Zhao Q, and Mandic DP, “Tucker tensor layer in fully connected neural networks,” arXiv preprint arXiv:1903.06133, 2019. [Google Scholar]

[R16] [16].Novikov A, Podoprikhin D, Osokin A, and Vetrov DP, “Tensorizing neural networks,” in Advances in neural information processing systems, 2015, pp. 442–450. [Google Scholar]

[R17] [17].Novikov A, Rodomanov A, Osokin A, and Vetrov D, “Putting mrfs on a tensor train,” in International Conference on Machine Learning, 2014, pp. 811–819. [Google Scholar]

[R18] [18].Fannes M, Nachtergaele B, and Werner RF, “Finitely correlated states on quantum spin chains,” Communications in mathematical physics, vol. 144, no. 3, pp. 443–490, 1992. [Google Scholar]

[R19] [19].Oseledets I, “A new tensor decomposition,” in Doklady Mathematics, vol. 80, no. 1. Pleiades Publishing, Ltd., 2009, pp. 495–496. [Google Scholar]

[R20] [20].Oseledets I and Tyrtyshnikov E, “Recursive decomposition of multidimensional tensors,” in Doklady Mathematics, vol. 80, no. 1. Springer, 2009, pp. 460–462. [Google Scholar]

[R21] [21].Orús R, “Tensor networks for complex quantum systems,” Nature Reviews Physics, vol. 1, no. 9, pp. 538–550, 2019. [Google Scholar]

[R22] [22].Bravyi S, Gosset D, and Movassagh R, “Classical algorithms for quantum mean values,” Nature Physics, vol. 17, no. 3, pp. 337–341, 2021. [Google Scholar]

[R23] [23].Rakhuba M and Oseledets I, “Calculating vibrational spectra of molecules using tensor train decomposition,” The Journal of Chemical Physics, vol. 145, no. 12, p. 124101, 2016. [DOI] [PubMed] [Google Scholar]

[R24] [24].Schollwöck U, “The density-matrix renormalization group in the age of matrix product states,” Annals of physics, vol. 326, no. 1, pp. 96–192, 2011. [Google Scholar]

[R25] [25].Stoudenmire E and Schwab DJ, “Supervised learning with tensor networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4799–4807. [Google Scholar]

[R26] [26].Bigoni D, Engsig-Karup AP, and Marzouk YM, “Spectral tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 38, no. 4, pp. A2405–A2439, 2016. [Google Scholar]

[R27] [27].Oseledets I and Tyrtyshnikov E, “Tt-cross approximation for multidimensional arrays,” Linear Algebra and its Applications, vol. 432, no. 1, pp. 70–88, 2010. [Google Scholar]

[R28] [28].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, pp. 1–39, 2013. [Google Scholar]

[R29] [29].Dolgov SV and Savostyanov DV, “Alternating minimal energy methods for linear systems in higher dimensions,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. A2248–A2271, 2014. [Google Scholar]

[R30] [30].Song Z, Woodruff DP, and Zhong P, “Relative error tensor low rank approximation,” in Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019, pp. 2772–2789. [Google Scholar]

[R31] [31].Li L, Yu W, and Batselier K, “Faster tensor train decomposition for sparse data,” Journal of Computational and Applied Mathematics, vol. 405, p. 113972, 2022. [Google Scholar]

[R32] [32].Lubich C, Rohwedder T, Schneider R, and Vandereycken B, “Dynamical approximation by hierarchical tucker and tensor-train tensors,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 2, pp. 470–494, 2013. [Google Scholar]

[R33] [33].Grasedyck L, Kluge M, and Kramer S, “Variants of alternating least squares tensor completion in the tensor train format,” SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. A2424–A2450, 2015. [Google Scholar]

[R34] [34].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]

[R35] [35].Steinlechner MM, “Riemannian optimization for solving highd-imensional problems with low-rank tensor structure,” EPFL, Tech. Rep, 2016. [Google Scholar]

[R36] [36].Novikov A, Izmailov P, Khrulkov V, Figurnov M, and Oseledets IV, “Tensor train decomposition on tensorflow (t3f).” Journal of Machine Learning Research, vol. 21, no. 30, pp. 1–7, 2020.34305477 [Google Scholar]

[R37] [37].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60–89, 2018. [Google Scholar]

[R38] [38].Candes EJ, Sing-Long CA, and Trzasko JD, “Unbiased risk estimates for singular value thresholding and spectral estimators,” IEEE transactions on signal processing, vol. 61, no. 19, pp. 4643–4657, 2013. [Google Scholar]

[R39] [39].Donoho D and Gavish M, “Minimax risk of matrix denoising by singular value thresholding,” The Annals of Statistics, vol. 42, no. 6, pp. 2413–2440, 2014. [Google Scholar]

[R40] [40].Cai J-F, Candès EJ, and Shen Z, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on optimization, vol. 20, no. 4, pp. 1956–1982, 2010. [Google Scholar]

[R41] [41].Chatterjee S, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, 2015. [Google Scholar]

[R42] [42].Klopp O, “Matrix completion by singular value thresholding: sharp bounds,” Electronic journal of statistics, vol. 9, no. 2, pp. 2348–2369, 2015. [Google Scholar]

[R43] [43].Zhang H, Cheng L, and Zhu W, “A lower bound guaranteeing exact matrix completion via singular value thresholding algorithm,” Applied and Computational Harmonic Analysis, vol. 31, no. 3, pp. 454–459, 2011. [Google Scholar]

[R44] [44].Nadler B, “Finite sample approximation results for principal component analysis: A matrix perturbation approach,” The Annals of Statistics, vol. 36, no. 6, pp. 2791–2817, 2008. [Google Scholar]

[R45] [45].Zhang A and Wang M, “Spectral state compression of markov processes,” IEEE Transactions on Information Theory, vol. 66, no. 5, pp. 3202–3231, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].De Lathauwer L, De Moor B, and Vandewalle J, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000. [Google Scholar]

[R47] [47].——, “On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000. [Google Scholar]

[R48] [48].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]

[R49] [49].Vannieuwenhoven N, Vandebril R, and Meerbergen K, “A new truncation strategy for the higher-order singular value decomposition,” SIAM Journal on Scientific Computing, vol. 34, no. 2, pp. A1027–A1052, 2012. [Google Scholar]

[R50] [50].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, pp. 1–34, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Kolda TG and Bader BW, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009. [Google Scholar]

[R52] [52].Sharan V and Valiant G, “Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use,” in International Conference on Machine Learning, 2017, pp. 3095–3104. [Google Scholar]

[R53] [53].Leurgans SE, Ross RT, and Abel RB, “A decomposition for three-way arrays,” SIAM Journal on Matrix Analysis and Applications, vol. 14, no. 4, pp. 1064–1083, 1993. [Google Scholar]

[R54] [54].Rajih M, Comon P, and Harshman RA, “Enhanced line search: A novel method to accelerate parafac,” SIAM journal on matrix analysis and applications, vol. 30, no. 3, pp. 1128–1147, 2008. [Google Scholar]

[R55] [55].Colombo N and Vlassis N, “Tensor decomposition via joint matrix schur decomposition,” in International Conference on Machine Learning, 2016, pp. 2820–2828. [Google Scholar]

[R56] [56].Anandkumar A, Deng Y, Ge R, and Mobahi H, “Homotopy analysis for tensor pca,” in Conference on Learning Theory. PMLR, 2017, pp. 79–104. [Google Scholar]

[R57] [57].Arous GB, Mei S, Montanari A, and Nica M, “The landscape of the spiked tensor model,” Communications on Pure and Applied Mathematics, vol. 72, no. 11, pp. 2282–2330, 2019. [Google Scholar]

[R58] [58].Hopkins SB, Shi J, and Steurer D, “Tensor principal component analysis via sum-of-square proofs,” in Conference on Learning Theory, 2015, pp. 956–1006. [Google Scholar]

[R59] [59].Luo Y and Zhang AR, “Tensor clustering with planted structures: Statistical optimality and computational limits,” arXiv preprint arXiv:2005.10743, 2020. [Google Scholar]

[R60] [60].Perry A, Wein AS, and Bandeira AS, “Statistical limits of spiked tensor models,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 56, no. 1. Institut Henri Poincaré, 2020, pp. 230–264. [Google Scholar]

[R61] [61].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems, 2014, pp. 2897–2905. [Google Scholar]

[R62] [62].Lesieur T, Miolane L, Lelarge M, Krzakala F, and Zdeborová L, “Statistical and computational phase transitions in spiked tensor estimation,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 511–515. [Google Scholar]

[R63] [63].Allen G, “Sparse higher-order principal components analysis,” in Artificial Intelligence and Statistics, 2012, pp. 27–36. [Google Scholar]

[R64] [64].Allen GI, “Regularized tensor factorizations and higher-order principal components analysis,” arXiv preprint arXiv:1202.2476, 2012. [Google Scholar]

[R65] [65].Liu Y, Chen L, and Zhu C, “Improved robust tensor principal component analysis via low-rank core matrix,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1378–1389, 2018. [Google Scholar]

[R66] [66].Lu C, Feng J, Chen Y, Liu W, Lin Z, and Yan S, “Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5249–5257. [Google Scholar]

[R67] [67].——, “Tensor robust principal component analysis with a new tensor nuclear norm,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 925–938, 2019. [DOI] [PubMed] [Google Scholar]

[R68] [68].Zhou P and Feng J, “Outlier-robust tensor pca,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2263–2271. [Google Scholar]

[R69] [69].Luo Y, Raskutti G, Yuan M, and Zhang AR, “A sharp blockwise tensor perturbation bound for orthogonal iteration,” Journal of machine learning research, vol. 22, no. 179, pp. 1–48, 2021. [Google Scholar]

[R70] [70].Wein AS, El Alaoui A, and Moore C, “The kikuchi hierarchy and tensor pca,” in 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2019, pp. 1446–1468. [Google Scholar]

[R71] [71].Benson AR, Gleich DF, and Lim L-H, “The spacey random walk: A stochastic process for higher-order data,” SIAM Review, vol. 59, no. 2, pp. 321–345, 2017. [Google Scholar]

[R72] [72].Raftery AE, “A model for high-order markov chains,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 47, no. 3, pp. 528–539, 1985. [Google Scholar]

[R73] [73].Tsay RS, Analysis of financial time series. John wiley & sons, 2005, vol. 543. [Google Scholar]

[R74] [74].Zhao J and Sun S, “High-order gaussian process dynamical models for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 2014–2019, 2016. [Google Scholar]

[R75] [75].Berchtold A and Raftery AE, “The mixture transition distribution model for high-order markov chains and non-gaussian time series,” Statistical Science, pp. 328–356, 2002. [Google Scholar]

[R76] [76].Ganguly A, Petrov T, and Koeppl H, “Markov chain aggregation and its applications to combinatorial reaction networks,” Journal of mathematical biology, vol. 69, no. 3, pp. 767–797, 2014. [DOI] [PubMed] [Google Scholar]

[R77] [77].Du Z, Ozay N, and Balzano L, “Mode clustering for markov jump systems,” in 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2019, pp. 126–130. [Google Scholar]

[R78] [78].Sanders J, Proutière A, and Yun S-Y, “Clustering in block markov chains,” The Annals of Statistics, vol. to appear, 2020. [Google Scholar]

[R79] [79].Zhu Z, Li X, Wang M, and Zhang A, “Learning Markov models via low-rank optimization,” arXiv preprint arXiv:1907.00113, 2019. [Google Scholar]

[R80] [80].Kearns MJ and Singh SP, “Finite-sample convergence rates for q-learning and indirect algorithms,” in Advances in neural information processing systems, 1999, pp. 996–1002. [Google Scholar]

[R81] [81].Duchi J, Shalev-Shwartz S, Singer Y, and Chandra T, “Efficient projections onto the l 1-ball for learning in high dimensions,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 272–279. [Google Scholar]

[R82] [82].Han R, Luo Y, Wang M, and Zhang AR, “Exact clustering in tensor block model: Statistical optimality and computational limit,” arXiv preprint arXiv:2012.09996, 2020. [Google Scholar]

[R83] [83].Liu Y, Wang F, Xiao Y, and Gao S, “Urban land uses and traffic source-sink areas: Evidence from gps-enabled taxi data in shanghai,” Landscape and Urban Planning, vol. 106, no. 1, pp. 73–87, 2012. [Google Scholar]

[R84] [84].Li SZ, Markov random field modeling in image analysis. Springer Science & Business Media, 2009. [Google Scholar]

[R85] [85].Zhang Y, Brady M, and Smith S, “Segmentation of brain mr images through a hidden markov random field model and the expectation-maximization algorithm,” IEEE transactions on medical imaging, vol. 20, no. 1, pp. 45–57, 2001. [DOI] [PubMed] [Google Scholar]

[R86] [86].Wei Z and Li H, “A markov random field model for network-based analysis of genomic data,” Bioinformatics, vol. 23, no. 12, pp. 1537–1544, 2007. [DOI] [PubMed] [Google Scholar]

[R87] [87].Chaplot DS, Bhattacharyya P, and Paranjape A, “Unsupervised word sense disambiguation using markov random field and dependency parser,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Google Scholar]

[R88] [88].Wainwright MJ and Jordan MI, “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008. [Google Scholar]

[R89] [89].Duan Y, Wang M, Wen Z, and Yuan Y, “Adaptive low-nonnegative-rank approximation for state aggregation of markov chains,” SIAM Journal on Matrix Analysis and Applications, vol. 41, no. 1, pp. 244–278, 2020. [Google Scholar]

[R90] [90].Puterman ML, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]

[R91] [91].Singh SP, Jaakkola T, and Jordan MI, “Reinforcement learning with soft state aggregation,” in Advances in neural information processing systems, 1995, pp. 361–368. [Google Scholar]

[R92] [92].Sutton RS and Barto AG, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135. [Google Scholar]

[R93] [93].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]

[R94] [94].Cai TT, Ma Z, and Wu Y, “Sparse pca: Optimal rates and adaptive estimation,” The Annals of Statistics, vol. 41, no. 6, pp. 3074–3110, 2013. [Google Scholar]

[R95] [95].Wainwright MJ, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019, vol. 48. [Google Scholar]

[R96] [96].Luo Y, Han R, and Zhang AR, “A schatten-q low-rank matrix perturbation analysis via perturbation projection error bound,” Linear Algebra and its Applications, vol. 630, pp. 225–240, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0024379521002962 [Google Scholar]

PERMALINK

Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

Yuchen Zhou

Anru R Zhang

Lili Zheng

Yazhen Wang

Abstract

I. Introduction

A. Problem Formulation

B. Our Contributions

Fig. 1.

C. Related Literature

D. Organization

II. Procedure of Tensor-Train Orthogonal Iteration

A. Notation and Preliminaries

B. Procedure of Tensor-Train Orthogonal Iteration

Fig. 2.

III. Theoretical Analysis

A. Representation Lemmas for high-order tensors

B. Deterministic Upper Bounds for Estimation Error of TTOI

IV. TTOI for Tensor-Train Spiked Tensor Model

V. TTOI for Dimension Reduction and State Aggregation in High-order Markov Chain

Fig. 4.

VI. Numerical Studies

A. Simulation

Fig. 5.

Fig. 6.

TABLE I.

Fig. 7.

Fig. 8.

Fig. 9.

B. Real Data Experiments

Fig. 10.

Fig. 11.

Fig. 12.

VII. Discussions and Additional Applications

Fig. 13.

Fig. 3.

Acknowledgments

Biographies

Appendix A

Proofs

A. Proof of Theorem III.1

B. Proof of Theorem III.2

C. Proof of Theorem IV.1

D. Proof of Corollary IV.1

E. Proof of Theorem IV.2

F. Proof of Proposition V.1

G. Proof of Proposition V.2

H. Proof of Lemma III.3

I. Technical Lemmas

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases