Skip to main content

Some NLM-NCBI services and products are experiencing heavy traffic, which may affect performance and availability. We apologize for the inconvenience and appreciate your patience. For assistance, please contact our Help Desk at info@ncbi.nlm.nih.gov.

NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 1.
Published in final edited form as: IEEE Trans Inf Theory. 2022 Feb 18;68(6):3991–4019. doi: 10.1109/tit.2022.3152733

Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

Yuchen Zhou 1, Anru R Zhang 2, Lili Zheng 3, Yazhen Wang 4
PMCID: PMC9585995  NIHMSID: NIHMS1809459  PMID: 36274655

Abstract

This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD [1] and new iterative backward/forward updates. We develop the general upper bound on estimation error for TTOI with the support of several new representation lemmas on tensor matricizations. By developing a matching information-theoretic lower bound, we also prove that TTOI achieves the minimax optimality under the spiked tensor model. The merits of the proposed TTOI are illustrated through applications to estimation and dimension reduction of high-order Markov processes, numerical studies, and a real data example on New York City taxi travel records. The software of the proposed algorithm is available online (https://github.com/Lili-Zheng-stat/TTOI).

Index Terms—: Tensor SVD, tensor-train, high-order tensors, orthogonal iteration, minimax optimality, high-order Markov chain

I. Introduction

Tensors, or high-order arrays, have attracted increasing attention in modern machine learning, computational mathematics, statistics, and data science. Some specific examples include recommender systems [2], [3], neuroimaging analysis [4], [5], latent variable learning [6], multidimensional convolution [7], signal processing [8], neural network [9], [10], computational imaging [11], [12], contingency table [13], [14]. In addition to low-order tensors (e.g., tensor with a relatively small value of order number), the high-order tensors also commonly arise in applications in statistics and machine learning. For example, in convolutional neural networks, parameters in fully connected layers can be represented as high-order tensors [15], [16]. In an order-d Markov process, where the future states depend on jointly the current and (d − 1) previous states, the transition probabilities form an order-(d + 1) tensor. For an order-d Markov decision process, the transition probabilities can be represented by an order-(2d + 1) tensor, with additional d directions representing past d actions. High-order tensors are also used to represent the joint probability in Markov random fields [17].

Compared to the low-order tensors, high-order tensors encompass much more parameters and sophisticated structure, while leading to inhibitive cost in storage, processing, and analysis: an order-d dimension-p tensor contains pd parameters. To address this issue, some low-dimensional parametrization is usually considered to capture the most informative subspaces in the tensor. In particular, the tensor-train (TT) decomposition [18], [19], [20], [1], [21] introduced a classic low-dimensional parameterization to model the subspaces and latent cores in high-order tensor structures. TT decomposition has been used in a wide range of applications in physics and quantum computation [22], [18], [21], [23], [24], signal processing [8], and supervised learning [25] among many others. For example, the TT decomposition framework is utilized in quantum information science for modeling complex quantum states and handling the quantum mean value problem [22], [18], [21], [23]. The TT-decomposition of a tensor Xp1××pd is defined as below:

Xi1,,id=G1,[i1,:]G2,[:,i2,:]Gd1,[:,id1,:]Gd,[id,:]=α1=1r1αd1=1rd1G1,[i1,α1]G2,[α1,i2,α2]Gd1,[αd2,id1,αd1]Gd,[id,αd1]. (1)

Here, the smallest values of r1, …, rd−1 that enable the decomposition (1) are called the TT-rank of X. [1] shows that the TT-rank rk=rank([X]k), i.e., the rank of the kth sequential unfolding of X (see formal definition of sequential unfolding in Section II-A). G1p1×r1, Gkrk1×pk×rk, Gdpd×rd1 are the TT-cores that multiply sequentially like a “train”: Xi1,,id equals the product of i1th vector in G1, i2th matrix in G2,,id1 matrix in Gd1, and idth vector in Gd. For convenience of presentation, we simplify (1) to

X=G1,G2,,Gd1,Gd

and denote r0 = rd = 1 throughout the paper. In particular, the TT rank and TT decomposition reduce to the regular matrix rank and decomposition when d = 2. If all dimensions p and ranks r are the same, the TT-parametrization involves O(2pr + (d − 2)pr2) values, which can be significantly smaller than the ones for Tucker-decomposition O(rd + dpr) and the regular parameterization O(pd).

In most of the existing literature, the TT-decomposition was considered under the deterministic settings, and the central goal was often to approximate the nonrandom high-order tensors by low-dimensional structures [26], [27], [1]. However, in modern applications in data science such as Markov processes, Markov decision processes, and Markov random fields, the (transition) probability tensor computed based on data is often a random realization of the underlying true tensor. In these cases, the estimation of the underlying low-dimensional parameters hidden in the noisy observations can be more important: an accurate estimation of the transition tensor renders reliable prediction for future states in high-order Markov chains and better decision-making in high-order Markov decision processes; an accurate estimation of probability tensor sheds light on the underlying relationship among different variables in a random system [17]. To achieve such a goal, it is crucial to develop dimension reduction methods that can incorporate TT-decomposition into probabilistic models. Since singular value decomposition (SVD) is one of the most important dimension reduction methods involving probabilistic models for matrices, and there is no counterpart of it for high-order tensors, we aim to fill this void by developing a statistical framework and a computationally feasible method for high-order tensor SVD in this paper.

A. Problem Formulation

This paper focuses on the following high-order tensor SVD model. Suppose we observe an order-d tensor Y that contains a hidden tensor-train (TT) low-rank structure:

Y=X+Z,     Y,X,Zk=1dpk. (2)

Here, X is TT-decomposable as (1) and Z is a noise tensor. Our goal is to estimate X and the TT cores of X based on Y. To this end, a straightforward idea is to minimize the approximation error as follows,

X^=arg minA is decomposable as (1) YAF2. (3)

However, the approximation error minimization (3) is highly non-convex and finding the global optimal solution, even if the rank r1 = ⋯ = rd−1 = 1, is NP-hard in general [28]. Instead, a variety of computationally feasible methods have been proposed to approximate the best tensor-train low-rank decomposition in the literature. TT-SVD, a sequential singular value thresholding scheme, was introduced by [1] to be discussed in detail later. [1] also proposed TT-rounding via sequential QR decompositions, which reduces the TT-rank while ensuring approximation accuracy. [29] introduced the alternating minimal energy algorithm to reconstruct a TT-low-rank tensor approximately based on only a small proportion of revealed entries of the target tensor. [30, Section L.2] proposed a sketching-based algorithm for fast low TT rank approximation of arbitrary tensors. [26] studied the tensor-train decomposition for functional tensors. [31] proposed the FastTT algorithm for fast sparse tensor decomposition based on parallel vector rounding and TT-rounding. [32] studied dynamical approximation with TT format for time-dependent tensors. [33] proposed the alternating least squares for tensor completion in the TT format. [34] studied the completion of low TT rank tensor and the applications to color image and video recovery. [35] studied the Riemannian optimization methods for TT decomposition and completion. Also see [36] for a TT decomposition library in TensorFlow. To our best knowledge, the estimation performance of most procedures here remains unclear. Departing from these existing work, in this paper, we make a first attempt to minimize the estimation error of X in addition to achieving the minimal approximation error under possibly random settings.

B. Our Contributions

Under Model (2), we make the following contributions to high-order tensor SVD in this paper.

First, we propose a new algorithm, Tensor-Train Orthogonal Iteration (TTOI), that provides a computationally efficient estimation of the low-rank TT structure from the noisy observation. The proposed algorithm includes two major steps. First, we obtain initial estimates G^1(0),G^2(0),,G^d1(0),G^d by performing forward sequential SVD based on matricizations and projections. This step was known as TT-SVD in the literature [1]. Next, we utilize the initialization and perform the newly developed backward updates and forward updates alternatively and iteratively. The TTOI procedure will be discussed in detail in Section II.

To see why the TTOI iterations yield better estimation than the classic TT-SVD method, recall that TT-SVD first performs singular value thresholding on [Y]1, i.e., the unfolding of Y, without any additional updates (see detailed procedure of TT-SVD and formal definition of [Y]1 in Section II-A), which can be inaccurate since [Y]1, a p1byk=2dpk matrix, has a great number of columns. In contrast, TTOI iteration utilizes the intermediate outcome of the previous iteration to substantially reduce the dimension of [Y]1 while performing singular value thresholding. In Figure 1, we provide a simple simulation example to show that even one TTOI iteration can significantly improve the estimation of the left singular subspace of G1 (left panel) and the overall tensor X (right panel). Therefore, a one-step TTOI, i.e., the initialization with one TTOI iteration, can be used in practice when the computational cost is a concern.

Fig. 1.

Fig. 1.

Average estimation error (dots) and standard deviation (bars) of sinΘ(U^1,U1) and X^XF by TT-SVD and one-step TTOI. Both algorithms are performed based on the observation Y generated from (2), where Z~i.i.d.N(0,σ2), X is a randomly generated order-5 tensor based on (1) with p = 20, r = 1, G1, G2,,Gd1, Gd~i.i.d.N(0,1).

We develop theoretical guarantees for TTOI. In particular, we introduce a series of representation lemmas for tensor matricizations with TT format. Based on them, we develop a deterministic upper bound of estimation error for both forward and backward updates in TTOI iterations. Under the benchmark setting of spiked tensor model, we develop matching upper/lower bounds and prove that the proposed TTOI algorithm achieves the minimax optimal rate of estimation error. To the best of our knowledge, this is the first statistical optimality results for high-order tensors with TT format. We also prove for any high-order tensor, TTOI iteration has monotone decreasing approximation error with respect to the iteration index.

Moreover, to break the curse of dimensionality in high-order Markov processes, we study the state aggregatable high-order Markov processes and establish a key connection to TT decomposable tensors. We propose a TTOI estimator for the transition probability tensor in high-order state-aggregatable Markov processes and establish the theoretical guarantee. We conduct simulation experiments to demonstrate the performance of TTOI and validate our theoretical findings. We also apply our method to analyze a New York taxi dataset. By modeling taxi trips as trajectories realized from a citywide Markov chain, we found that the Manhattan traffic zone exhibits high-order Markovian dependence and the proposed TTOI reveals latent traffic patterns and meaningful partition of Manhattan traffic zones. Finally, we discuss several applications that our proposed algorithm is applicable to, including transition probability tensor estimation in high-order Markov decision processes and joint probability tensor estimation in Markov random fields.

C. Related Literature

In addition to the aforementioned literature on TT decomposition, our work is also related to a substantial body of work on matrix/tensor decomposition and SVD, spiked tensor model, etc. These literature are from a range of communities including applied mathematics, information theory, machine learning, scientific computing, signal processing, and statistics. Here we try to review existing literature in these communities without claiming this literature survey is exhaustive.

First, the matrix singular value thresholding was commonly used and extensively studied in various problems in data science, including matrix denoising [37], [38], [39], matrix completion [40], [41], [42], [43], principal component analysis (PCA) [44], Markov chain state aggregation [45]. Such the task was also widely considered for tensors of order-3 or higher. In particular, to perform SVD and decomposition for tensors with Tucker low-rank structures, [46], [47] introduced the higher-order SVD (HOSVD) and higher-order orthogonal iteration (HOOI). [48] established the statistical and computational limits of tensor SVD, compared the theoretical properties of HOSVD and HOOI, and proved that HOOI achieves both statistical and computational optimality. [49] introduced the sequentially truncated higher-order singular value decomposition (ST-HOSVD). [50] introduced a thresholding & projection based algorithm for sparse tensor SVD. A non-exhaustive list of methods for SVD and decomposition for tensors with CP low-rank structures include alternating least squares [51], [52], eigendecomposition-based approach [53], enhanced line search [54], power iteration with SVD-based initialization [6], simultaneous diagonalization and higher-order SVD [55].

In addition, the spiked tensor model and tensor principal component analysis (tensor PCA) are widely discussed in the literature. [56], [57], [58], [59], [60], [61] considered the statistical and computational limits of rank-1 spiked tensor model. [62] studied the statistical and computational phase transitions and theoretical properties of the approximate message passing algorithm (AMP) under a Bayesian spiked tensor model. [63], [64] developed the regularization-based methods for tensor PCA. [65], [66], [67], [68] studied the robust tensor PCA to handle the possible outliers from the tensor observation.

Different from Tucker and CP decompositions, which have been a pinpoint in the enormous existing literature on tensors, we focus on the TT-structure associated with high-order tensors for the following reasons: (1) Tucker and CP decompositions do not involve the sequential structure of different modes, i.e., the Tucker and CP decompositions still hold if the d modes are arbitrarily permuted. While in applications such as high-order Markov process, high-order Markov decision process, and fully connected layers of deep neural networks, the order of different modes can be crucial; (2) the number of entries involved in the low-Tucker-rank parameterization grows exponentially with respect to the order d (rd); (3) methods that explore CP low-rank structure can be numerically unstable for high-order tensors in computation as pointed out by [27]. In comparison, the TT-structure incorporates the order of different modes sequentially and involves much fewer parameters for high-order tensors, which renders it more suitable in many scenarios.

In Section V, we will further discuss the application of TTOI on high-order Markov processes and state aggregation. This problem is related to a body of literature on dimension reduction and state aggregation for Markov processes that we will discuss in Section V.

D. Organization

The rest of the article is organized as follows. In Section II, after a brief introduction of the notation and preliminaries, we introduce the procedure of the tensor-train orthogonal iteration. The theoretical results, including three representation lemmas, a general estimation error bound, and the minimax optimal upper and lower bounds under the spiked tensor model, are provided in Sections III and IV. The application to high-order Markov chains is discussed in Section V. The simulation and real data analysis are provided in Sections VI-A and VI-B, respectively. Discussions and further applications to Markov random fields and high-order Markov decision processes are briefly discussed in Section VII. All technical proofs are provided in Section A.

II. Procedure of Tensor-Train Orthogonal Iteration

A. Notation and Preliminaries

We first introduce the notation and preliminaries to be used throughout the paper. We use the lowercase letters, e.g., x, y, z, to denote scalars or vectors. We use C, c, C0, c0, … to denote generic constants, whose actual values may change from line to line. A random variable z is σ-sub-Gaussian if Eet(zEz)eσ2t2/2 for any t. We say ab or a = O(b) if aCb for some uniform constant C > 0. We write a=O˜(b) if a = O(b logC(b)) for constant C′ > 0. The capital letters, e.g., X, Y, Z, are used to denote matrices. Specifically, Op,r{Up×r:UU=Ir} is the set of all p-by-r matrices with orthogonal columns. For UOp,r, let UOp,pr be the orthonormal complement of U, and let PU = UU denote the projection matrix onto the column space of U. For any matrix Ap1×p2, let A=i=1p1p2siuivi be the singular value decomposition, where s1(A)sp1p2(A)0 are the singular values of A in non-increasing order. Define smin(A)=sp1p2(A), SVDrL(A)=[u1ur]Op1,r, and SVDrR(A)=[v1vr]Op2,r be the smallest non-trivial singular value, leading r left singular vectors, and leading r right singular vectors of A, respectively. We also write SVDL(A)=SVDp1p2L(A) and SVDR(A)=SVDp1p2L(A) as the collection of all left and right singular vectors of A, respectively. Define the Frobenius and spectral norms of A as AF=i=1p1j=1p2Aij2=i=1p1p2si2(A) and A=s1(A)=maxxp2Ax2/x2. For any two matrices Um1×n1 and Vm2×n2, let

UV=[U11VU1n1VUm11VUm1n1V](m1m2)×(n1n2)

be their Kronecker product. To quantify the distance among subspaces, we define the principle angles between U, U^Op,r as an r-by-r diagonal matrix: Θ(U,U^)=diag(arccos(s1),,arccos(sr)), where s1 ≥ ⋯ ≥ sr ≥ 0 are the singular values of UU^. Define the sinΘ norm as

sinΘ(U,U^)=diag(sin(arccos(s1)),,sin(arccos(sr)))=1sr2.

The boldface calligraphic letters, e.g., X, Y, Z, are used to denote tensors. For an order-d tensor1 Xi=1dpi and 1 ≤ kd − 1, we define [X]k(p1××pk)×(pk+1pd) as the sequential unfolding of X with rows enumerating all indices in Modes 1, …, k and columns enumerating all indices in Modes (k + 1), ⋯, d, respectively. That is, for any 1 ≤ kd and 1 ≤ ikpk,

([X]k)ξ1(i1,,id;k),ξ2(i1,,id;k)=Xi1id,

where ξ1(i1, …, id; k) = (ik − 1)p1pk−1 + (ik−1 − 1)p1pk−2 + ⋯ + i1 and ξ2(i1, …, id; k) = (id − 1)pk+1pd−1 + (id−1 − 1)pk+1pd−2 + ⋯ + ik+1. Following the convention of reshape function in MATLAB, we define the reshape of any matrix X of dimension p1pk × pk+1pd as an inverse operation of tensor matricization: X=Reshape(X,p1,p2,,pd) if X=[X]k. For any two matrices Aq1×q2q3 and A˜q1q2×q3, we denote A˜=Reshape(A,q1q2,q3) and A=Reshape(A˜,q1,q2q3) if and only if

A˜(i21)p1+i1,i3=Ai1,(i31)p2+i2,     1ijqj,j=1,2,3.

We also define the tensor Frobenius norm of X as XF2=i1=1p1id=1pdXi1,,id2. For any matrix Ap1×p2 and any tensor Bp1××pd, let vec(A) and vec(B) be the vectorization of A and B, respectively. Formally, for any 1 ≤ kd and 1 ≤ ikpk,

(vec(B))(id1)p1pd1+(id11)p1pd2++i1=Bi1,,id.

B. Procedure of Tensor-Train Orthogonal Iteration

We are now in position to introduce the procedure of Tensor-Train Orthogonal Iteration (TTOI). The pseudocode of the overall procedure is given in Algorithm 1. TTOI includes three main parts: we first run initialization, then perform backward update and forward update alternatively and iteratively.

II.

  • Part 1: Initialization. First, we obtain an initial estimate of TT-cores G1,G2,,Gd1,Gd. This step is the tensor-train-singular value decomposition (TT-SVD) originally introduced by [1].
    1. Let R1(0) be the unfolding of Y along Mode 1. We compute the top-r1 SVD of R1(0). Let U^1(0)Op1,r1 be the first r1 left singular vectors of R1(0) and calculate R˜1(0)=(U^1(0))R1(0)r1×(p2pd). Then, U^1(0) is an initial estimate of the subspace that G1 lies in and R˜1(0) can be seen as the projection residual.
    2. Next, we realign the entries of R˜1(0)r1×(p2pd) to R2(0)(r1p2)×(p3pd), where the rows and columns of R2(0) correspond to indices of Modes-1, 2 and Modes-3, …, d, respectively. Then, we evaluate the top-r2 SVD of R2(0). Let U^2(0) be the first r2 left singular vectors of R2(0) and evaluate R˜2(0)=(U^2(0))R2(0)r2×p3pd. Again, U^2(0) is an estimate of the singular subspace that G2 lies on and R˜2(0) is the projection residual for the next calculation.
    3. We apply Step (ii) on R˜2(0) to obtain U^3(0)Or2p3,r3 and R˜3(0)r3×(p4pd); …; apply Step (ii) on R˜d2(0) to obtain U^d1(0)Ord2pd1,rd1 and R˜d1(0)rd1×pd. Then we reshape matrix U^k(0)(pkrk1)×rk to tensor U^k(0)rk1×pk×rk for k = 2, …, d − 1. Now, (U^1(0),U^2(0),,U^d1(0),R˜d1(0)) yield the initial estimates of TT-cores of X and we expect that
      XX(0)=U^1(0),U^2(0),,U^d1(0),R˜d1(0).
    The initialization step is summarized to Algorithm 1(a) and illustrated in Figure 2. In summary, we perform SVD on some “residual” Rk(0) sequentially for k = 1, …, d − 1. As will be shown in Lemma III.3, Rk(0) satisfies
    Rk(0)=(IpkU^k1(0))(Ip2pkU^1(0))[Y]k,
    where [Y]k(p1pk)×(pk+1pd) is the kth sequential unfolding of Y (see definition in Section II-A). This quantity plays a key role in the backward update next.

    The initialization step mainly focuses on the left singular spaces of [X]k while ignoring the information included in the right singular spaces. Due to this fact, we develop the following new backward update that utilizes both the left and right singular space estimates from the previous step to refine our estimates. Similarly, we can also perform a forward update to further improve the outcome of backward update, and then iteratively alternate between backward and forward updates. The detailed descriptions of these two updates are presented as follows, and a further explanation is given in Remark II.1.

  • Part 2: Backward update. For iterations t = 1, 3, 5, …, we perform backward update, i.e., to sequentially obtain V^d(t),,V^2(t) based on the intermediate results from the (t − 1)st iteration (0th iteration is the initialization). The pseudocode of backward update is provided in Algorithm 1(b). The calculation in Algorithm 1(b) is equivalent to
    V^d(t)=SVDR(R˜d1(t1)),

    graphic file with name nihms-1809459-f0002.jpg

    V^k(t)=SVDR(R˜k1(t1)(V^d(t)Ipkpd1)(V^k+1(t)Ipk))

    graphic file with name nihms-1809459-f0003.jpg

    graphic file with name nihms-1809459-f0004.jpg

    for k = d − 1, …, 2, and
    V^1(t)=[Y]1(V^d(t)Ip2pd1)(V^3(t)Ip2)V^2(t)p1×r1.
    Here,
    R˜k(t1)=(U^k(t1))(IpkU^k1(t1))(Ip2pkU^1(t1))[Y]k
    are the projection residual term in the intermediate outcome of the (t − 1)st iteration. Then, we reshape V^k(t)rk1×(pkrk) to V^k(t)rk1×pk×rk. The backward up-dated estimate is
    X^(t)=V^1(t),V^2(t),,V^d1(t),V^d(t)
    Remark II.1 (Interpretation of backward update). The backward updates utilize and extract the right singular vectors of the intermediate products of the (t − 1)st iteration,
    R˜k(t1)=(U^k(t1))(IpkU^k1(t1))(Ip2pkU^1(t1))[Y]k,
    as opposed to the entire data [Y]k. Such a dimension reduction scheme is the key to the backward update: it can simultaneously reduce the dimension of the matrix of interest, [Y]k, and the noise therein, while preserving the signal strength. Different from the initialization in Step 1, the backward update utilizes the information from both the forward and backward singular subspaces of the tensor-train structure of X. See Section III for more illustration.
  • Part 3: Forward Update. For iteration t = 2, 4, 6, …, we perform forward update, i.e., to sequentially obtain U^1(t),,U^d(t) based on the intermediate results from the (t − 1)st iteration. Essentially, the forward update can be seen as a reversion of the backward update by flipping all modes of tensor Y. The pseudocode of this procedure is collected in Algorithm 1(c). Recall [Y]1(V^d(t1)Ip2pd1)(V^3(t1)Ip2)V^2(t1) is the intermediate product from the (t − 1)st update. We sequentially compute
    U^1(t)=SVDL([Y]1(V^d(t1)Ip2pd1)(V^3(t1)Ip2)V^2(t1));
    U^1(t)=SVDL((IpkU^k1(t))(Ip2pkU^1(t))[Y]k(V^d(t1)Ipk+1pd1)(V^k+2(t1)Ipk+1)V^k+1(t1))
    for k = 2, …, d − 1, and
    U^d(t)=[(U^d1(t))(Ipd1(U^d2(t)))(Ipd1p2(U^1(t)))[Y]d1]pd×rd1.
    Reshape U^k(t)(pkrk1)×rk to U^k(t)rk1×pk×rk for k = 2, …, d − 1. Then, compute
    X^(t)=U^1(t),U^2(t),,U^d1(t),U^d(t)

    We will explain the algebraic schemes in the TTOI procedure through several representation lemmas in Section III-A. We will also show in Theorem III.2 that the objective function YX^(t)F2 is monotone decreasing with respect to the iteration index t. In the large-scale scenarios that performing iterations is beyond the capacity of computing, we can reduce the number of iterations, and even to tmax = 1, i.e., the one-step iteration, which have often yielded sufficiently accurate estimation as we will illustrate in both theory and simulation studies. Such the phenomenon has been recently discovered for HOOI in the Tucker low-rank tensor decomposition [69].

    Remark II.2 (Computational and storage costs of TTOI). We consider the computational and storage costs of TTOI on the p-dimensional, rank-r, order-d, and dense tensor. Since computing the first r singular vectors of an m × n matrix via block power method requires O˜(mnr) operations, initialization costs O˜(pdr) operations, each iteration of TTOI, including forward and backward updates, costs O(pdr). Therefore, the total number of operations of TTOI with T iterations is O˜(pdr)+O(Tpdr), which is not significantly more than the number of elements of the target tensor. Moreover, TTOI requires O(pd) storage cost, which is not significantly more than the storage cost of the original tensor.

Fig. 2.

Fig. 2.

A Pictorial Illustration of Initialization (Algorithm 1(a), d = 3)

III. Theoretical Analysis

This section is devoted to the theoretical analysis of the proposed procedure. For convenience, we introduce the following two abbreviations for matrix sequential products: for Mi(piri1)×ri, 1 ≤ id − 1 and Bj(rjpj)×rj1, 2 ≤ jd, we denote

Mprod,k(L)=(Ip2pkM1)(IpkMk1)Mk(p1pk)×rk,     1kd1,
Bprod,k(R)=(BdIpkpd1)(Bk+1Ipk)Bk(pkpd)×rk1,     2kd.

Equivalently, Mprod,k(L) and Bprod,k(R) can be defined sequentially as

Mprod,1(L)=M1,Mprod,k+1(L)=(Ipk+1Mprod,k(L))Mk+1,     1kd2,
Bprod,d(R)=Bd,Bprod,k(R)=(Bprod,k+1(R)Ipk)Bk,     2kd1.

A. Representation Lemmas for high-order tensors

Since the computation of high-order tensors with tensor-train structures involves extensive tensor algebra, we introduce the following three lemmas on the matrix representation of high-order tensors. These lemmas play a fundamental role in the later theoretical analysis.

Lemma III.1 (Representation for sequential matricization of TT-decomposable tensor). Suppose X=G1,G2,,Gd1,Gd. Then the sequential matricization of X can be written as

[X]k=(Ip2pkG1)(Ip3pk[G2]2)(Ipk[Gk1]2)[Gk]2[Gk+1]1([Gk+2]1Ipk+1)([Gd1]1Ipk+1pd2)(GdIpk+1pd1). (4)

Lemma III.2 (Representation of tensor reshaping). For any tensor Tk=1dpk and 1 ≤ i < jd − 1, we have

[T]j=(Ipi+1pj[T]i)A(pi+1pj,pj+1pd),
[T]i=A(pi+1pj,p1pi)([T]jIpi+1pj)

Here, we define ek(ij) as the kth canonical basis of ij and

A(i,j)=[e1(ij)ei+1(ij)ei(j1)+1(ij)e2(ij)ei+2(ij)ei(j1)+2(ij)ei(ij)e2i(ij)eij(ij)](i2j)×j. (5)

Lemmas III.1 and III.2 can be proved by checking each entry of the corresponding matricizations. In addition, the following lemma provides a representation of sequential reshaping tensor, in particular for Rk(t) and R˜k(t), the key intermediate outcomes in TTOI procedure.

Lemma III.3 (Representation of sequential reshaping tensor). Suppose Tk=1dpk, Mi(ri1pi)×ri for 1 ≤ id − 1, Bi(piri)×ri1 for 2 ≤ id, where r0 = rd = 1. Consider the following sequential multiplication:

Forward sequential multiplication: Let S1=[T]1. For k = 1, …, d − 1, calculate

S˜k=MkSkrk×(pk+1pd),
Sk+1=Reshape(S˜k,rkpk+1,pk+2pd) if k<d1.

Then for any 1 ≤ kd − 1,

Sk=(IpkMprod,k1(L))[T]k,      S˜k=Mprod,k(L)[T]k. (6)

Here, IpkMprod,k1(L)=Ip1 if k = 1.

Backward sequential multiplication: Let Wd1=[T]d1. For k = d − 1, …, 1, calculate

W˜k=WkBk+1(p1pk)×rk
Wk1=Reshape(W˜k,p1pk1,pkrk)      if k>1.

Then for any 1 ≤ kd − 1,

Wk=[T]k(Bprod,k+2(R)Ipk+1),      W˜k=[T]kBprod,k+1(R).

Here, Bprod,k+2(R)Ipk+1=Ipd if k = d − 1.

In particular, Rk(0), R˜k(0) in Algorithm 1(a) and Rk(t), R˜k(t)(t{2,4,6,}) in Algorithm 1(c) satisfy

Rk(t)=(Ipk(U^(t))prod,k1(L))[Y]k,R˜k(t)=(U^(t))prod,k(L)[Y]k,     1kd1. (7)

The proof of Lemma III.3 is provided in Section AH.

B. Deterministic Upper Bounds for Estimation Error of TTOI

Now we are in position to analyze the performance of TTOI. The following Theorem III.1 introduces an upper bound on estimation error of X^(2t+1) (backward update) and X^(2t+2) (forward update).

Theorem III.1. Suppose we observe Y=X+Z, where X admits a TT decomposition as (1).

(A deterministic estimation error bound for backward updates) Let U˜1(2t)=U1p1×r1 be the left singular space of [X]1. For 2 ≤ kd − 1, define U˜k(2t)pkrk1×rk as the left singular subspace of (Ipk(U^(2t))prod,k1(L))[X]k. If for some constant c0 ∈ (0, 1),

sin Θ(U^k(2t),U˜k(2t))c0,     1kd1, (8)

then there exists a constant Cd > 0 that only depends on d such that the outcome of Algorithm 1(b) satisfies

X^(2t+1)XF2Cd(k=1d1Ak(2t+1)+B(2t+1)), (9)

where

Ak(2t+1)=(U^(2t))prod,k(L)[Z]k((V^(2t+1))prod,k+2(R)Ipk+1)F2,
B(2t+1)=[Z]1(V^(2t+1))prod,2(R)F2.

Here, (V^(2t+1))prod,k+2(R)Ipk+1=Ipd if k = d − 1.

(A deterministic estimation error bound for forward updates) For 2 ≥ kd − 1, let V˜k(2t+1)(pkrk)×rk1 be the right singular space of [X]k1((V^(2t+1))prod,k+1(R)Ipk) and let V˜d(2t+1)=Vdpd×rd1 be the right singular space of [X]d1. If for some constant c0 ∈ (0, 1),

sin Θ(V^k(2t+1),V˜k(2t+1))c0,     2kd,

then there exists a constant Cd > 0 that only depends on d such that the outcome of Algorithm 1(c) satisfies

X^(2t+2)XF2Cd(k=1d1Ak(2t+2)+B(2t+2)), (10)

where

Ak(2t+2)=(Ipk(U^(2t+2))prod,k1(L))[Z]k(V^(2t+1))prod,k+1(R)F2,
B(2t+2)=(U^(2t+2))prod,d1(L)[Z]d1F2.

Here, Ipk(U^(2t+2))prod,k1(L)=Ip1 if k = 1.

The proof of Theorem III.1 is provided in Section A-A. Theorem III.1 shows the estimation error X^(t+1)XF2 can be bounded by the projected noise Z, i.e., Ak(t+1) and B(t+1), if the estimates in initialization (t = 0) or the previous iteration (t ≥ 1), {U^k(t)}k=1d1 or {V^k(t)}k=2d, are within constant distance to the true underlying subspaces. The developed upper bound can be significantly smaller than CZF2, the classic upper bound induced from the approximation error (e.g., Theorem 2.2 in [1]), especially in the high-dimensional setting (pr).

Remark III.1 (Interpretation of error bounds in Theorem III.1). Here, we provide some explanation for Ak(2t+1) and B(2t+1) in the error bound (9). By algebraic calculation, the TT-core estimation via backward update can be written as

V^k+1(2t+1)=SVDR{(U^(2t))prod,k(L)([X]k+[Z]k)((V^(2t+1))prod,k+2(R)Ipk+1)}

for any 1 ≤ kd − 1 and

V^1(2t+1)=([X]1+[Z]1)(V^(2t+1))prod,2(R).

From the definition of Ak(2t+1), we have see Ak(2t+1) quantifies the error of the singular subspace estimate V^k+1(2t+1) and B(2t+1) quantifies the error of the projected residual V^1(2t+1). By symmetry, similar interpretation also applies to Ak(2t+2) and B(2t+2) for the error bound of forward update (10).

Remark III.2 (Proof Sketch of Theorem III.1). While the complete proof of Theorem III.1 is provided in Section A-A, we provide a brief proof sketch here.

Without loss of generality, we focus on (9) for t = 0 while other cases follows similarly. For convenience, we simply let U^i, V^i denote U^i(0), V^i(1), respectively. First, by Lemma III.1, we can transform [X^(1)]1, the outcome of backward update, to

[X^(1)]1=[Y]1P(V^dIp2pd1)(V^3Ip2)V^2. 

Then we can further bound the estimation error of X^(1) as

X^(1)XF2C[Z]1(V^dIp2pd1)(V^3Ip2)V^2F2+Cdk=2d[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)F2.

Next, based on Lemma III.2 and (8), we can prove

[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)F=[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFCdU^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF.

Finally, we apply the perturbation projection error bound (Lemma A.3) to prove that

CdU^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFCdU^k1(Ipk1U^k2)(Ip2pk1U^1)[Z]k1(V^dIpkpd1)(V^k+1Ipk)F.

Theorem (IV.1) is proved by combing all inequalities above.

Next, we establish a decomposition formula for the approximation error, i.e., the objective function in (3) YX(t)F2, and show that the approximation error is monotone decreasing through TTOI iterations.

Theorem III.2 (Approximation error decays through iterations). We implement TTOI on Y. Let X^(t) be the outcome after the tth iteration. For any k ≥ 1, we have

(Approximation error decay)YF2X^(t+1)F2YF2X^(t)F2, (11)
(Approximation error decomposition)YX^(t+1)F2=YF2X^(t+1)F2. (12)

See Section AB for the proof of Theorem III.2.

IV. TTOI for Tensor-Train Spiked Tensor Model

In this section, we further focus on a probabilistic setting, spiked tensor model, where the noise tensor Z has independent, mean zero, and σ-sub-Gaussian entries (see definition in Section II-A). The spiked tensor model has been widely studied as a benchmark setting for tensor PCA/SVD and dimension reduction in recent literature in machine learning, information theory, statistics, and data science [62], [61], [60], [70], [48]. The central goal therein is to discover the underlying low-rank tensor X. Most of the existing works focused on tensors with Tucker or CP decomposition.

Under the spiked tensor model, we can verify that the initialization step of TTOI gives sufficiently good initial estimations with high probability that matches the required condition in Theorem III.1.

Theorem IV.1 (Probabilistic bound for initial estimates and projected noise). Suppose X is TT-decomposable as (1) and Z have independent zero mean and σ-sub-Gaussian random variables. Denote p = min{p1, ⋯, pd}. If there exists a constant Cgap such that λk=srk([X]k)Cgap((i=1dpiri1ri)1/2+(pk+1pd)1/2)σ for 1 ≤ kd − 1, then there exist some constants C, c > 0 and Cd > 0 that only depends on d, with probability at least 1 − C exp(−cp),

maxk=1,,d1sin Θ(U^k(0), U˜k(0))12, (13)
maxk=1,,d1t=2,4,6,sinΘ(U^k(t),U˜k(t))12,maxk=2,,dt=1,3,5,sinΘ(V^k(t),V˜k(t))12, (14)

and for all t ≥ 1,

max{Ak(t),B(t)}Cdσ2i=1dpirirr1. (15)

Here, U˜k(t), V˜k(t), Ak(t) and B(t) are defined in Theorem III.1.

The proof of Theorem IV.1 is provided in Section AC. Based on Theorems III.1 and IV.1, we can further prove:

Corollary IV.1 (Upper bound for estimation error). Suppose X can be decomposed as (1), Zi1,,id are independent zero mean and σ-sub-Gaussian random variables, p = min{p1, ⋯, pd}. Suppose there exists a constant Cgap such that λk=srk([X]k)Cgap((i=1dpiri1ri)1/2+(pk+1pd)1/2) σ for 1 ≤ kd − 1. Then with probability at least 1 − Ce−cp, for all t ≥ 1,

X^(t)XF2Cdσ2i=1dpiriri1. (16)

The proof of Corollary IV.1 is provided in Section AD.

Remark IV.1 (Interpretation of Corollary IV.1). Note that the TT-cores G1, Gi, Gd respectively have p1r1, piriri−1, pdrd−1 free parameters, the upper bound (16) can be seen as the noise level σ2 times the degrees of freedom of the low TT rank tensors.

Next, we develop a minimax lower bound for the low TT rank structure estimation. Consider the following general class of tensors with dimension p = (p1, …, pd) and TT rank r = (r1, …, rd−1),

Fp,r(λ)={Xp1××pd,X can be decomposed as (1),srk([X]k)λk,1kd1}, (17)

and a class of distributions of σ-sub-Gaussian noise tensors

D={D : if Z~D, then Zi1,,id are indep. zero mean  and σ sub-Gaussian random variables}. (18)

Here, the constraints on the least singular value of [X]k and the σ-sub-Gaussian assumption correspond to the conditions required for upper bound in Theorem IV.1.

Theorem IV.2 (Lower bound). Consider the order-d TT spiked tensor model (2) and distribution class D in (18). Assume p = min{p1, …, pd} ≥ C0 for some large constant C0, r1p1/2, ripiri−1/2, ri−1piri/2 for 2 ≤ id − 1, rd−1pd, and λi > 0. Also assume r1r2p1 if d = 3. Then there exists a constant cd > 0 that only depends on d such that

infX^supXFp,r(λ),DDEZ~DX^XF2cdσ2i=1dpiriri1. (19)

See Section AE for the proof of Theorem IV.2.

V. TTOI for Dimension Reduction and State Aggregation in High-order Markov Chain

Since the introduction at the beginning of the 20th century, the Markov process has been ubiquitous in a variety of disciplines. In the literature, the first order Markov process, i.e., the future observation at (t + 1) is conditionally independent of those at times 1, …, (t − 1) given the immediate past observation at time t, has been commonly used and extensively studied. Moreover, the high-order Markov process often appear in many scenarios, where the future observation is affected by a longer history. For example, in the taxi travel trajectory, the future stop of a taxi not only depends on the current location but also the past path that reveals the direction this taxi is heading to [71]. The high-order Markov processes have also been applied to inter-personal relationship [72], financial econometrics [73], traffic flow [74], among many other applications.

We specifically consider an ergodic, time-invariant, and (d − 1)st order Markov process on a finite state space {1, …, p}. That is, the future state Xt+d depends on the current state Xt+d−1 and the previous (d − 2) states (Xt+d−2, …, Xt+1) jointly:

(Xt+dX1,,Xt+d1)=(Xt+dXt+1,,Xt+d1)=P[Xt+1,,Xt+d]. (20)

Our goal is to achieve a reliable estimation of the transition tensor P and to predict the future state Xt+d based on an observable trajectory. Since the total number of free parameters in a (d − 1)st order Markov transition tensor P is O(pd) without further assumptions, it may be prohibitively difficult to infer P in both statistics and computation even if p and d are only of moderate scale. Instead, a sufficient dimension reduction for high-order Markov processes is in demand.

To enable the statistical inference and dimension reduction for high-order Markov processes, a powerful tool, mixed transition distribution model (MTD), was introduced [72]. The MTD model assumes that the distribution of future state is a linear combination of the distributions associated with the (d − 1) immediate past states. The readers are also referred to [75] for a survey on mixed transition distribution model. The linear assumption, however, does not take into account the potential interactions of past states that commonly appear in practice. For example in the New York taxi trip data, the interaction among past locations of a taxi indicates its potential future direction.

On the other hand, there is a recent surge of development in dimension reduction and state aggregation for first order Markov chains. For example, [76] considered the Markov chain aggregation and the application to biology; [77] considered the rank-reduced Markov model and mode clustering; [45] considered Markov rank, aggregagability, and lumpability of Markov processes and proposed the dimension reduction and state aggregation methods through spectral decomposition with theoretical guarantees; [78] proposed clustering block model and proposed efficient algorithm to solve it; [79] introduced a convex and non-convex methods to estimate the rank-reduced low-rank Markov transition matrix.

Inspired by these work, we propose and study the state aggregation model for the discrete-time high-order Markov processes as follows.

Definition V.1 ((d − 1)st order state aggregatable Markov1 process). Suppose there exist maps G1:[p]r1, Gk:[p]×rk1rk, Gd:[p]×rd1 such that G2, …, Gd are linear: Gk(X, λ1u + λ2v) = λ1Gk(X, u) + λ2Gk(X, v) for any vectors u, v, scalars λ1,λ2. We say a Markov process {X1, X2, …} is (d − 1)st order state aggregatable if for all t ≥ 0, the transition can be sequentially generated as follows,

P˜1(Xt+1)=G1(Xt+1)r1,P˜k(Xt+1,,Xt+k)=Gk(Xt+k,P˜k1(Xt+1,,Xt+k1))rk,     k=2,,d1,(Xt+dX1,,Xt+d1)=(Xt+dXt+1,,Xt+d1)=Gd(Xt+d,P˜d1(Xt+1,,Xt+d1)).

In a (d − 1)st order state aggregatable Markov process, the future state Xt+d relies on a sequential aggregation of the previous d − 1 states Xt+1, …, Xt+d−1 as follows: we first project Xt+1 to a r1-dimensional vector P˜1(Xt+1) via G1, then project P˜1(Xt+1) jointly with Xt+2 to a r2-dimensional vector P˜1(Xt+1,Xt+2) via G2. We repeat such the projection sequentially for Xt+3, …, Xt+d and yield the transition probability (Xt+dXt+1,,Xt+d1). Also, see Figure 4 for a pictorial illustration.

Fig. 4.

Fig. 4.

A pictorial illustration of a (d − 1)st order state aggregatable Markov chain

Based on the definition of the state aggregatable Markov chain, we can prove the corresponding probability transition tensor P will have low TT rank.

Proposition V.1. The transition tensor P of the rank reduced high-order Markov model in Definition V.1 has TT-rank no more than (r1, …, rd−1). In other words, P satisfies rank([P]k)rk.

The proof of Proposition V.1 is provided in Section AF.

Next, we focus on a synchronous or generative setting, which can be seen as a high-order generalization of the classic observation model for the analysis of Markov (decision/reward) processes (see [80] for an introduction), for the high-order Markov process. To be specific, for each sample index k = 1, …, n and previous states (i1, …, id−1) ∈ [p]d−1, suppose we observe the next state X(i1, …, id−1; k) drawn from the Markov transition tensor P. It is natural to estimate P via the empirical transition tensor: for i1, …, id ∈ {1, …, p}d,

P^i1,,idemp=k=1n1{X(i1,,id1;k)=id}/n.

Then, P^emp is an unbiased estimator of P. However, if the entries of P are approximately balanced, the mean squared error of P^emp satisfies

EP^empPF2=i1,,idVar(P^i1,,idemp)=i1,,id1id(idi1,,id1)(1(idi1,,id1))npd1n, (21)

To obtain a more accurate estimator, we propose to first perform TTOI on P^emp to obtain P^(1), then project each row of [P^(1)]d1, or equivalently, each mode-d fiber of P^(1), onto the simplex Sp1={xp:i=1pxi=1,xi0 for all 1ip} via probability simplex projection (see an implementation in [81]) and obtain P^.

We establish an upper bound on estimation error for the TTOI estimator P^.

Proposition V.2. Consider the synchronous or generative model for a (d − 1)st order state aggregatable Markov process described above. Suppose the initialization condition (8) in Theorem III.1 holds. Then with probability at least 1 − Cecp, the output of one-step TTOI followed by the probability simplex projection satisfies

P^PF2C(max1id1ri)i=1dpiriri1/n.

The proof of Proposition V.2 is provided in Section AG. Compared to the estimation error rate of P^emp in (21), Proposition V.2 shows TTOI achieves significantly reduced estimation error by exploiting the low TT rank structure of the high-order Markov process.

Remark V.1. If the observations form one transition trajectory {X0, …, XN}, we can work on the following empirical transition tensor:

P^i1,,idemp=t=0Nd+11{Xt=i1,,Xt+d1=id}t=0Nd+11{Xt=i1,,Xt+d2=id1},ift=1Nd+11{Xt=i1,,Xt+d2=id1}>0;P^i1,,idemp=1/p,ift=1Nd+11{Xt=i1,,Xt+d2=id1}=0. (22)

Then P^emp can be a nearly unbiased and strongly consistent estimator for P. When the Markov process is (d − 1)st order state aggregatable, we can apply TTOI to obtain a better estimate. As will be explored by numerical studies in Section VI-A, the TTOI estimator achieves favorable performance on the estimation of P.

VI. Numerical Studies

In this section, we investigate the numerical performance of TTOI.

A. Simulation

In each simulation setting, we present the numerical results in both average estimation error (denoted by dots) and standard deviation (denoted by bars) based on 100 repetitions. We assume the true TT-ranks are known in the first three settings. Afterwards, we introduce a BIC-type data-driven scheme for TT-rank selection and present its numerical performance. All experiments are conducted by a quad-core 2.3 GHz Intel Core i5 processor.

We first consider the tensor-train spiked tensor model (2) discussed in Section IV. Specifically, we randomly generate G1,G2,,Gd1,Gd with i.i.d. standard normal entries, and generate Z with i.i.d. N(0,σ2) or Unif(−b, b) entries. Let p1 = ⋯ = pd = p, r1 = ⋯ = rd−1 = r, and consider four settings: (1) p = 100, d = 3, r = 1; (2) p = 50, d = 4, r = 1; (3) p = 20, d = 5, r = 1; (4) p = 20, d = 5, r = 2. For varying values of σ ∈ [1, 19] and b ∈ [3, 30], we evaluate the estimation error X^(t)XF of the TT-SVD and TTOI estimators with 1 or 2 iterations, i.e., tmax = 0, 1, 2. From the results summarized in Figure 5 (normal noise) and Figure 6 (uniform noise), we can see TTOI, even with one iteration, performs significantly better than TT-SVD, and the advantage becomes more significant as the noise level σ, b grows. This suggests that the proposed TTOI is effective for high-order tensor SVD compared to the classic TT-SVD, especially when the observations are corrupted by substantial noise. Table I summarizes the runtime of TT-SVD and TTOI, which suggests that the additional computational cost incurred by the backward and forward updates in TTOI is negligible compared to the runtime of the original TT-SVD.

Fig. 5.

Fig. 5.

Estimation error of TT-SVD and TTOI for high-order spiked tensor model. Here, Z~i.i.d.N(0,σ2).

Fig. 6.

Fig. 6.

Estimation error of TT-SVD and TTOI for high-order spiked tensor i.i.d. model. Here, Z~i.i.d.Unif(b,b).

TABLE I.

Runtime (in seconds) of TT-SVD, TTOI with 1 iteration, and TTOI with 2 iterations under the high-order spiked tensor model with Z~I.I.D.N(0,400). The mean runtime of 50 independent replicates are presented and the standard deviations are listed in parentheses.

(p, d, r) TT-SVD TTOI (tmax = 1) TTOI (tmax = 2)
(100, 3, 1) 0.332 (0.071) 0.334 (0.071) 0.340 (0.074)
(50, 4, 1) 1.165 (0.173) 1.169 (0.172) 1.201 (0.171)
(20, 5, 1) 0.725 (0.093) 0.730 (0.092) 0.751 (0.095)
(20, 5, 2) 0.672 (0.100) 0.676 (0.101) 0.708 (0.103)

To understand the influence of TT-rank to the performance of the TT-SVD and TTOI estimators, we conduct numerical experiments under the spiked tensor model (2) with r1 = ⋯ = rd−1 = r for various values of r. In particular, G1,G2,,Gd1,Gd are still generated with i.i.d. standard normal entries, and Z has i.i.d. N(0,σ2) entries. Letting p1 = ⋯ = pd = p, we consider two settings: (1) p = 100, d = 3, σ = 20; (2) p = 500, d = 3, σ = 100. For r = 1, …, 10, we evaluate the average estimation error X^(t)XF of TT-SVD, TTOI with 1 iteration, and TTOI with 2 iterations (i.e., tmax = 0, 1, 2), and present the results in Figure 7. Figure 7 suggests that the estimation errors increase as the rank increases, while TTOI with 1 or 2 iterations both performs better than TT-SVD. The improvement of TTOI over TT-SVD is more significant under larger p or smaller r. An intuitive explanation for this phenomenon is as follows: the key idea of TTOI is to utilize the previous updates to reduce the dimension of the sequential unfolding [Y]k before performing singular value thresholding; such the dimension reduction is more significant for large p or small r.

Fig. 7.

Fig. 7.

Estimation error of TT-SVD and TTOI for high-order spiked tensor model with varying TT-ranks

Next, we demonstrate the performance of TTOI on transition tensor estimation for the high-order state-aggregatable Markov chains studied in Section V. We consider the (d − 1)st order Markov chain on p states. To generate the transition tensor P, we first draw G˜1p×r, G˜2r×p×r,, G˜dr×p with i.i.d. standard normal entries, then normalize the rows of G˜1,G˜2,,G˜d in absolute values as

G1,[i,j]=|G˜1,[i,j]|jG˜1,[i,j]],     Gk,[i1,i2,j]=|G˜k,[i1,i2,j]|j|G˜k,[i1,i2,j]|,    Gd,[i,j]=|G˜d,[i,j]|j|G˜d,[i,j]|.

By this means, P=G1,G2,,Gd1,Gd satisfies Pi1,,id0, id=1pPi1,,id=1 for any (i1, …, id−1), so P forms a Markov transition tensor. To generate the trajectory {X1, …, XN}, we generate the initial d − 1 states X1, …, Xd−1 i.i.d. uniformly from [p], then generate Xd, …, XN sequentially according to (20). To estimate P, we construct the empirical probability tensor P^emp by (22), then apply TT-SVD and TTOI with input P^emp as detailed in Section V to obtain P^. We consider two numerical settings: (1) p = 100, d = 3, r = 1; (2) p = 50, d = 4, r = 1. We evaluate the estimation error P^(i)PF for each setting and summarize the results to Figure 8. Again, TTOI exhibits clear advantage over the existing methods in all simulation settings.

Fig. 8.

Fig. 8.

Estimation error of the transition tensor versus length of the observable trajectory in high order state-aggregatable Markov chain estimation.

Selection of TT-ranks. The proposed TTOI algorithm requires specifying TT-ranks r1, …, rd−1 as inputs and the appropriate choices of r1, …, rd−1 are crucial in practice. We propose a data-driven scheme to select the TT-ranks: we choose r1, …, rd−1 ≥ 1 such that the following Bayesian information criterion (BIC) under the spiked tensor model is minimized:

BIC(r1,,rd1)k=1dpklogYX^(r1,,rd1)F2+(p1r1+k=2d1pkrk1rk+pdrd1)(k=1dlogpk). (23)

Here, X^(r1,,rd1) is the output of TTOI (Algorithm 1) with the input TT-ranksb r1, …, rd−1. This BIC-type criterion was also adopted in prior works on tensor clustering [82].

Then we conduct numerical experiments under the same setting as the bottom two plots in Figure 5 on the spiked tensor model with Gaussian noise. Figure 9 summarizes the estimation errors of TT-SVD and TTOI with 1 and 2 iterations, respectively, with the ranks selected based on the proposed BIC criterion (23). Comparing Figure 9 to the bottom two plots in Figure 5, we can see the proposed criterion can select the true ranks accurately and the performance of both TT-SVD and TTOI with tuned ranks is very similar to the one by inputting the true ranks.

Fig. 9.

Fig. 9.

Average estimation error of TT-SVD and TTOI for high-order spiked tensor model with BIC-tuned ranks.

B. Real Data Experiments

We apply the proposed method to investigate the Manhattan taxi data. This dataset contains the New York City taxi trip records from 14,144 drivers in 2013. We treat each travel record as a transition among different locations at New York City, then the overall dataset can be organized as a collection of fragmented sample trajectories of a Markov chain on New York City traffic. Some recent analysis on such data can be seen at, e.g., [71], [83], [45].

Due to the high-dimensional spatiotemporal nature of the dataset, a sufficient dimension reduction or state aggregation is often a crucial first step to study a metropolitan-wide traffic pattern. To this end, we apply the high-order Markov model as described in Section V. Specifically, we discretize the Manhattan region into a grid of p = 119 states that forms a state space. Then, we collect all travel records in Manhattan of each driver from the dataset, sort them by time, and form into Markovian transition trajectories. In particular, each travel record is treated as a transition from the pickup to the drop-off location. If the drop-off location i of the previous trip is different from the pickup location j of the next trip by the same driver, we also form a transition from states i to j. Based on the trajectories, we can construct a high-order Markov chain with an order d empirical transition probability tensor P^emp k=1dp as described in Section V. Assuming the true probability tensor is state aggregatable (Definitionb V.1), we apply one-step TTOI proposed in Section V and obtain P^. It is noteworthy if d = 2, the described procedure of P^ is equivalent to the classic matrix spectral decomposition in the literature. Figure 10 plots the singular values of the sequential unfolding matrices of P^emp for d = 3, which clearly demonstrates the low-TT-rankness of the probability transition tensor P. In the following experiments, we focus on the order-2 Markov model and analyze all consecutive two transitions: ijk, corresponding to the d = 3 case.

Fig. 10.

Fig. 10.

Singular values of sequential unfolding matrices [P^emp]1 (left panel) and [P^emp]2 (right panel)

Inspired by the classic methods of matrix spectral decomposition, we aggregate all location states in Manhattan into a few clusters via both P^ and P^emp. Specifically, we calculate G^d, i.e., the last TT-core of P^, and [P^emp]d1, i.e., the matricization of P^emp whose columns correspond to the last mode. Then we perform k-means on all columns of G^d and [P^emp]d1, record the cluster index, associate the index to each location state, and plot the results in Figure 11 (Panels (a)(b) are for TTOI and Panels (c)(d) are for empirical estimate). From Figure 11 (a)(b), we can clearly identify four regions: (i) lower Manhattan (orange), (ii) midtown (dark blue), (iii) upper west side (green), and (iv) upper east side (brown or black). In contrast, direct clustering on Pemp yields less interpretable results as the majority points go to one cluster. It is also worth noting even the location information is not provided to this experiment, the resulting clusters in Figures 11 (a)(b) show good spatial proximity between locations. This illustrates the effectiveness of TTOI in dimension reduction and state-aggregation for high-order Markov processes.

Fig. 11.

Fig. 11.

State aggregation based on TTOI and empirical estimate

Next, we illustrate the high-order nature of the city-wide taxi trip through the following experiment. For each initial state i ∈ [p], we apply k-means to cluster the column span of P^[i,:,:], where P^ is the outcome of TTOI. We present the results in Figure 12, where the red triangles denote the given first state i and r = k = 7. If the city-wide taxi trips do not have significant high-order effects, P^ should be reducible to a first order Markov process and P^[i,:,:] should have similar values for different i. However, as we can see from Figure 12 that the clustering results highly depends on the first state i, the high-order effects exist in the city-wide taxi trip Markov process. In addition, the states in different directions of i are often clustered to different regions, which shows that the taxi drivers may tend to move to the same direction in consecutive trips, which yields the high-order effects in the driving trajectories.

Fig. 12.

Fig. 12.

Based on second order Markov model, state aggregation results are different with different initial state (the red triangle denotes the initial state i in each subfigure)

VII. Discussions and Additional Applications

In this paper, we propose a general framework for high-order SVD. We introduce a novel procedure, tensor-train orthogonal iteration (TTOI), that efficiently estimates the low tensor train rank structure from the high-order tensor observation. TTOI has significant advantages over the classic ones in the literature. We establish a general deterministic error bound for TTOI with the support of several new representation lemmas for tensor matricizations. Under the commonly studied spiked tensor model, we establish an upper bound for TTOI and a matching information-theoretic lower bound. We also illustrate the merits of TTOI through simulation studies and a real data example in New York City taxi trips.

In addition to the high-order Markov processes, the proposed TTOI can also be applied to the Markov random field (MRF) estimation. We give a brief description of MRF below. Consider an undirected graph G = (V, E), where V = {1, …, d} is a set of vertices and EV × V is a collection of edges. Each vertex iV is associated with a random variable Xi, taking values in {s1, …, sp}. In an MRF model, the distribution of {X1, …, Xd} can be factorized as

(X1,,Xd)=1ZCCψC(XC),

where C is a collection of subgraphs of G and XC = (Xv, vC) denotes the random vector corresponding to vertices in C. The joint probability function () can be written as a tensor Pk=1dp, where Pi1,,id=(X1=si1,,Xd=sid). The MRFs have a wide range of applications, including image analysis [84], [85], genomic study [86], and natural language processing [87]. The readers are referred to, e.g., [88] for an introduction to MRFs.

A central problem of MRF is how to estimate the population density P based on a limited number of samples {X1(i),,Xd(i)}i=1n. It is straightforward to estimate P via the empirical probability tensor P^emp:

P^i1,,idemp=i=1nk=1d1(Xk(i)=sik)/n.

We can show that P^emp is unbiased for P. Recently, [17] pointed out that P is often approximately low tensor-train rank in practice. To further exploit such the structure, we can conduct TTOI on P^emp. Under regularity conditions, it can be shown that the entries of Z are bounded and weakly independent, then Corollary IV.1 suggests the following estimation error rate of the TTOI estimator: P^PF2Ci=1driri1/(np2d1), which can be significantly smaller than the estimation error of original empirical estimator P^emp.

Moreover, the proposed framework can be also applied to high-order Markov decision process (high-order MDP). MDP has been commonly used as a baseline in control theory and reinforcement learning [89], [90], [91], [92]. Despite the wide applications of MDPs, most of the existing work focus on the first-order Markov processes. However, the high-order effects often appear, i.e., the transition probability at the current time depends not only on current, but also the past (d − 1) states and actions. See Figure 13 for an example. Since the number of free parameters in such MDPs can be huge, a sufficient dimension reduction for the state and action space can be a crucial first step. Similarly to the example of high-order Markov process in Section V, the TTOI can be applied to achieve better dimension reduction and state aggregation for the high-order Markov decision processes.

Fig. 13.

Fig. 13.

Illustration of a high-order state aggregatable Markov decision process

Fig. 3.

Fig. 3.

A pictorial illustration of TT-Backward update (Algorithm 1(b), d = 3)

Acknowledgments

The research of Yuchen Zhou and Anru R. Zhang were supported in part by NSF under Grants CAREER-1944904, DMS-1811868, and NIH under Grant R01 GM131399; the research of Yazhen Wang was supported in part by NSF under Grants DMS-1707605 and DMS-1913149. This work was done while Yuchen Zhou, Anru R. Zhang, and Lili Zheng were at the University of Wisconsin-Madison.

Biographies

Yuchen Zhou is a postdoctoral researcher in the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania. He received the B.E. degree from Peking University in 2016 and the Ph.D. degree in statistics from the University of Wisconsin-Madison in 2021. His research interests include high-dimensional statistical inference, tensor data analysis, reinforcement learning and statistical learning theory.

Anru R. Zhang is the Eugene Anson Stead, Jr. M.D. Associate Professor in the Department of Biostatistics & Bioinformatics and Associate Professor in the Departments of Computer Science, Mathematics, and Statistical Science at Duke University. He was an assistant professor of statistics at the University of Wisconsin-Madison in 2015–2021. He obtained his bachelors degree from Peking University in 2010 and his Ph.D. from the University of Pennsylvania in 2015. His work focuses on high-dimensional statistical inference, non-convex optimization, statistical tensor analysis, computational complexity, and applications in genomics, microbiome, electronic health records, and computational imaging. He received the ASA Gottfried E. Noether Junior Award (2021), a Bernoulli Society New Researcher Award (2021), an ICSA Outstanding Young Researcher Award (2021), and an NSF CAREER Award (2020).

Lili Zheng is a postdoctoral researcher in the Department of Electrical and Computer Engineering at Rice University. She received her bachelor’s degree from University of Science and Technology of China (USTC) in 2016 and her Ph.D. degree in statistics from University of Wisconsin - Madison in 2021. Her research interests span dependent data, high-dimensional statistics, network analysis, tensor modeling, stochastic algorithms, and non-convex optimization.

Yazhen Wang is Chair and Professor of Statistics at the University of Wisconsin-Madison. He obtained his Ph.D in statistics from University of California at Berkeley in 1992. He is the fellows of Institute of Mathematical Statistics (IMS) and American Statistical Association (ASA). He served on numerous professional committees of ASA, IMS and ICSA, as NSF program director, editors of Statistics Sinica and Statistics and Its Interface, and associate editors of various journals including Annals of Statistics, Annals of Applied Statistics, Journal of the American Statistical Association, Journal of Business and Economic Statistics, and Statistica Sinica. His research interests are financial econometrics, quantum computation, machine learning, high dimensional statistics, nonparametric curve estimation, wavelets, change points, long-memory processes, and order restricted inference.

Appendix A

Proofs

We collect all technical proofs of this paper in this section.

A. Proof of Theorem III.1

For convenience, let U^i, V^i, Ri and R˜i denote U^i(0), V^i(1), Ri(0) and R˜i(0), respectively. By Lemma III.1 and

Ip2pdP(V^dIp2pd1)(V^3Ip2)V^2=P(V^dIp2pd1)(V^3Ip2)V^2+P(V^dIp2pd1)(V^4Ip2p3)(V^3Ip2)++PV^dIp2pd1,

we have

X^(1)XF2=[[Y]1(V^dIp2pd1)(V^3Ip2)V^2]V^2(V^3Ip2)(V^dIp2pd1)[X]1F2=[Z]1P(V^dIp2pd1)(V^3Ip2)V^2+[X]1P(V^dIp2pd1)(V^3Ip2)V^2[X]1F2C([Z]1P(V^dIp2pd1)(V^3Ip2)V^2F2+[X]1P(V^dIp2pd1)(V^3Ip2)V^2F2+[X]1P(V^dIp2pd1)(V^4Ip2p3)(V^3Ip2)F2++[X]1PV^dIp2pd1F2)C([Z]1(V^dIp2pd1)(V^3Ip2)V^2F2+[X]1(V^dIp2pd1)(V^3Ip2)V^2F2+[X]1(V^dIp2pd1)(V^4Ip2p3)(V^3Ip2)F2++[X]1(V^dIp2pd1)F2). (24)

To prove (9), we only need to show that for all 2 ≤ kd,

[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)FCU^k1(Ipk1U^k2)(Ip2pk1U^1)[Z]k1(V^dIpkpd1)(V^k+1Ipk)F, (25)

where

[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)=[X]1(V^dIp2pd1)(V^3Ip2)V^2

if k = 2 and

[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)=[X]1(V^dIp2pd1)

if k = d.

By Lemma III.2, we have

[X]1(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)F=[A(p2pk1,p1)]([X]k1Ip2pk1)(V^dIp2pd1)(V^k+1Ip2pk)(V^kIp2pk1)F=[A(p2pk1,p1)](([X]k1(V^dIpkpd1)(V^k+1Ipk)V^k)Ip2pk1)F=[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF. (26)

The third equation holds since the realignment doesn’t change the Frobenious norm.

Moreover, recall that U1p1×r1 is the left singular space of [X]1, and U˜jpjrj1×rj is the left singular space of (IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[X]j for 2 ≤ jd − 1, by Lemma III.2, for any 2 ≤ kd − 1,

[X]k=(Ip2pk[X]1)A(p2pk,pk+1pd)=(Ip2pkPU1[X]1)A(p2pk,pk+1pd)=(Ip2pkPU1)(Ip2pk[X]1)A(p2pk,pk+1pd)=(Ip2pkPU1)[X]k, (27)

and for any 2 ≤ j < k,

(IpjpkU^j1)(Ipj1pkU^j2)(Ip2pkU^1)[X]k=(IpjpkU^j1)(Ipj1pkU^j2)(Ip2pkU^1)(Ipj+1pk[X]j)A(pj+1pk,pk+1pd)=(Ipj+1pk[(IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[X]j])A(pj+1pk,pk+1pd)=(Ipj+1pk[PU˜j(IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[X]j])A(pj+1pk,pk+1pd)=(Ipj+1pkPU˜j)(IpjpkU^j1)(Ipj1pkU^j2)(Ip2pkU^1)(Ipj+1pk[X]j)A(pj+1pk,pk+1pd)=(Ipj+1pkPU˜j)(IpjpkU^j1)(Ipj1pkU^j2)(Ip2pkU^1)[X]k, (28)

where A(i,j) is defined in (5) for any i, j > 0. Therefore, by (27),

[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF=(Ip2pk1PU1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF=(Ip2pk1U1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF(Ip2pk1U^1)(Ip2pk1U1)(Ip2pk1U1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFsmin1((Ip2pk1U^1)(Ip2pk1U1))=(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFsmin1(U^1U1). (29)

The inequality holds since BFABFsmin1(A) for any invertible matrix Am1×m1 and Bm1×m2; in the last step, we used (Ip2pk1U1)(Ip2pk1U1)[X]k1=(Ip2pk1PU1)[X]k1=[X]k1. Similarly to (29), by (28), for 1 ≤ jk − 2,

(Ipj+1pk1U^j)(Ip2pk1U^1)[X]k1·(V^dIpkpd1)(V^k+1Ipk)V^kF=(Ipj+2pk1PU˜j+1)(Ipj+1pk1U^j)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF=(Ipj+2pk1U˜j+1)(Ipj+1pk1U^j)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF(Ipj+2pk1U^j+1)(Ipj+1pk1U^j)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF·smin1(U^j+1U˜j+1). (30)

By (29) and (30),

[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFsmin1(U^1U1)(Ip3pk1U^2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFsmin1(U1U^1)smin1(U˜2U^2)U^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kFsmin1(U1U^1)smin1(U˜2U^2)smin1(U˜k1U^k1)U^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF(11c02)k1CU^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF. (31)

By the definition of V^k(pkrk)×rk1 and Lemma III.3, we know that V^k is the right singular space of

U^k1(Ipk1U^k2)(Ip2pk1U^1)[Y]k1(V^dIpkpd1)(V^k+1Ipk)=U^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)+U^k1(Ipk1U^k2)(Ip2pk1U^1)[Z]k1(V^dIpkpd1)(V^k+1Ipk),

Lemma A.3 shows that

U^k1(Ipk1U^k2)(Ip2pk1U^1)[X]k1(V^dIpkpd1)(V^k+1Ipk)V^kF2U^k1(Ipk1U^k2)(Ip2pk1U^1)[Z]k1·(V^dIpkpd1)(V^k+1Ipk)F. (32)

Combine (26), (31) and (32) together, we know that (25) holds for all 2 ≤ kd, which has finished the proof of Theorem III.1.

B. Proof of Theorem III.2

For i ≥ 1, by the definition of X(2i) and Lemma III.1, we have

YX^(2i)F2=(Ip1pd1P(Ip2pd1U^1(2i))(Ipd1U^d2(2i))U^d1(2i))·[Y]d1F2=[Y]d1F2P(Ip2pd1U^1(2i))(Ipd1U^d2(2i))U^d1(2i)[Y]d1F2=YF2X^(2i)F2.

Similarly, we have

YX^(2i1)F2=YF2X^(2i1)F2.

In addition, we have

YX^(2i)F2=[Y]d1F2P(Ip2pd1U^1(2i))(Ipd1U^d2(2i))U^d1(2i)[Y]d1F2=[Y]d1F2U^d1(2i)(Ipd1U^d2(2i))(Ip2pd1U^1(2i))[Y]d1F2=[Y]1F2U^d1(2i)(Ipd1U^d2(2i))(Ip2pd1U^1(2i))[Y]d1V^d(2i1)F2U^d1(2i)(Ipd1U^d2(2i))(Ip2pd1U^1(2i))·[Y]d1V^d(2i1)F2[Y]1F2U^d1(2i)(Ipd1U^d2(2i))(Ip2pd1U^1(2i))[Y]d1V^d(2i1)F2=[Y]1F2(Ipd1U^d2(2i))(Ipd2pd1U^d3(2i))(Ip2pd1U^1(2i))[Y]d1V^d(2i1)F2.

The last equation holds since U^d1(2i) is the left singular space of (Ipd1U^d2(2i))(Ipd2pd1U^d3(2i))(Ip2pd1U^1(2i))[Y]d1V^d(2i1) For any Bn×r and 1 ≤ lr, we can check that the l-th columns of A(m,n)B and (ImBIm)A(m,r) are equal:

(A(m,n)B)[:,l]=j=1nBj,lk=1me(k1)mn+(j1)m+k(m2n)=((ImBIm)A(m,r))[:,l]

where e(k1)mn+(j1)m+k(m2n) is the ((k−1)mn+(j−1)m+k)-th canonical basis of m2n and A(i,j) is defined in (5). Therefore,

A(m,n)B=(ImBIm)A(m,r).

By the last equation and Lemma III.2, we have

(Ipd1U^d2(2i))(Ipd2pd1U^d3(2i))(Ip2pd1U^1(2i))[Y]d1V^d(2i1)=(Ipd1U^d2(2i))(Ipd2pd1U^d3(2i))(Ip2pd1U^1(2i))(Ipd1[Y]d2)A(pd1,pd)V^d(2i1)=(Ipd1(U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2))(Ipd1(V^d(2i1)Ipd1))A(pd1,rd1)=(Ipd1(U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)))A(pd1,rd1)=Reshape(U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1),rd2pd1,rd1).

Since the realignment does not change the Frobenius norm, we have

YX^(2i)F2[Y]1F2U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)F2. (33)

By similar proof of (33), we have

YX^(2i)F2[Y]1F2     U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)F2=[Y]1F2U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)V^d1(2i1)F2U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)V^d1(2i1)F2[Y]1F2U^d2(2i)(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)V^d1(2i1)F2=[Y]1F2(Ipd2U^d3(2i))(Ip2pd2U^1(2i))[Y]d2(V^d(2i1)Ipd1)V^d1(2i1)F2[Y]1F2[Y]1(V^d(2i1)Ip2pd1)(V^3(2i1)Ip2)V^2(2i1)F2=[Y]1(Ip2pdP(V^d(2i1)Ip2pd1)(V^3(2i1)Ip2)V^2(2i1))F2=YX^(2i1)F2.

Similarly, we can prove (11) holds for k = 2i, i ≥ 0.

C. Proof of Theorem IV.1

Without loss of generality, we assume σ2 = 1. We still let U^i, V^, Ri and R˜i denote U^i(0), V^i(1), Ri(0) and R˜i(0), respectively.

Lemma A.2 Part 4 immediately shows that (15) holds with probability at least 1 − Cecp. Next, we show that with probability at least 1 − Cecp,

sinΘ(U^k,U˜k)Ci=1k1piri1ri+pkrk1+pk+1pdλk12,     1kd1. (34)

Recall that

U^1=SVDr1L([Y]1),     [Y]1=[X]1+[Z]1,

where [X]1p1×p1 satisfying rank([X]1)=r1,[Z]1p1×p1, by Lemmas A.3 and A.2, with probability 1−Cecp, we have

U^1[X]12[Z]1C(p11/2+(p2pd)1/2).

Therefore, with probability at least 1 − Cecp,

sinΘ(U^1,U1)U^1U1U1[X]1sr1(U1[X]1)=U^1[X]1sr1([X]1)Cp1+p2pdλ1.

For 2 ≤ ijd − 1, by the definition of U˜i and Lemma III.2, we have

[X]j=(Ip2pj[X]1)A(p2pj,pj+1pd)=(Ip2pj(PU1[X]1))A(p2pj,pj+1pd)=(Ip2pjPU1)(Ip2pj[X]1)A(p2pj,pj+1pd)=(Ip2pjU1)(Ip2pjU1)[X]j (35)

and

(IpipjU^i1)(Ip2pjU^1)[X]j=(Ipt+1pj(IptU^i1))(Ipt+1pj(Ip2ptU^1))(Ipi+1pj[X]i)A(pt+1pj,pj+1pd)=(Ipi+1pj((IpiU^i1)(Ip2piU^1)[X]i))A(pi+1pj,pj+1pd)=(Ipi+1pj(PU˜i(IpiU^i1)(Ip2piU^1)[X]i))A(pi+1pj,pj+1pd)=(Ipi+1pjPU˜i)(Ipi+1pj((IpiU^i1)(Ip2piU^1)[X]i))A(pi+1pj,pj+1pd)=(Ipi+1pjU˜i)(Ipi+1pjU˜i)(IpipjU^i1)(Ip2pjU^1)[X]j, (36)

where Ipi+1pj=1 if i = j. Let

Lk=sinΘ(U˜k,U^k),     2kd1.

For k = 2, by (35) and Lemma A.1, with probability at least 1 − Cecp,

sr2((Ip2U^1)[X]2)smin((Ip2U^1)(Ip2U1))sr2([X]2)=smin(U^1U1)λ2=1sinΘ(U^1,U1)2λ234λ2.

Since U^2=SVDr2L((Ip2U^1)[Y]2), and (Ip2U^1)[Y]2=(Ip2U^1)[X]2+(Ip2U^1)[Z]2, by Lemma A.3 and Lemma A.1, we know that with probability at least 1 − Cecpr,

U^2(Ip2U^1)[X]22(Ip2U^1)[Z]2C(p2r1+(p3pd)1/2+p1r1).

Combine the two previous inequalities together and recall that U˜2 is the left singular space of (Ip2U^1)[X]2, we have

sinΘ(U^2,U˜2)U^2U˜2U˜2(Ip2U^1)[X]2sr2(U˜2(Ip2U^1)[X]2)=U^2(Ip2U^1)[X]2sr2((Ip2U^1)[X]2)Cp1r1+p2r1+(p3pd)1/2λ2

with probability at least 1 − Cecp.

Assume that (34) holds for kj − 1 with probability 1 − Cecp. For k = j, by Lemma A.1 and (36), with probability at 1 − Cecp, we have

srj((IpjU^j1)(Ipj1pjU^j2)(Ip2pj1pjU^1)[X]j)smin((IpjU^j1)(IpjU˜j1))srj((Ipj1pjU^j2)(Ip2pj1pjU^1)[X]j)=smin(U^j1U˜j1)srj((Ipj1pjU^j2)(Ip2pj1pjU^1)[X]j)smin(U^j1U˜j1)smin((Ipj1pjU^j2)(Ipj1pjU˜j2))srj((Ipj2pj1pjU^j3)(Ip2pj1pjU^1)[X]j)smin(U^j1U˜j1)smin(U^1U˜1)srj([X]j)=1Lj121L12λj(3/4)j1λjcλj. (37)

In the last inequality, we used the fact that d is a fixed number and (3/4)j1(3/4)d1c.

By the definition of U^j and Lemma III.3, we have

U^j=SVDrjL((IpjU^j1)(Ipj1pjU^j2)(Ip2pj1pjU^1)[Y]j).

Note that

(IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[Y]j=(IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[X]j+(IpjU^j1)(Ipj1pjU^j2)(Ip2pjU^1)[Z]j,

by Lemma A.3, with probability at least 1ecpr2,

U^j(IpjU^j1)(Ipj1pjU^j2)(Ip2pj1pjU^1)[X]j2(IpjU^j1)(Ipj1pjU^j2)(Ip2pj1pjU^1)[Z]jC[(i=1j1piri1ri)1/2+(pjrj1)1/2+(pj+1pd)1/2].

Therefore, with probability at least 1 − Cecp,

sinΘ(U^j,U˜j)U^jU˜jU˜j(IpjU^j1)(Ip2pjU^1)[X]jsrj(U˜j(IpjU^j1)(Ip2pjU^1)[X]j)=U^j(IpjU^j1)(Ip2pjU^1)[X]jsrj((IpjU^j1)(Ip2pjU^1)[X]j)C(i=1j1piri1ri)1/2+(pjrj1)1/2+(pj+1pd)1/2λj.

Therefore, (13) holds with probability 1 − Cecp.

Finally, we consider (14). Let E0={(13) and (15) hold}. Without loss of generality, we only show that under E0,

sinΘ(V^k,V˜k)Ci=1dpiri1riλk112,     2kd. (38)

In fact, (38) can be proved by induction. Let Vdpd×rd1 be the right singular space of [X]d1. Then there exists an orthogonal matrix Q˜d1Ord1 such that

VdQ˜d1=SVDR(U^d1(Ipd1U^d2)(Ipd1p2U^1)[X]d1).

Similarly to (37), under E0,

srd1(U^d1(Ipd1U^d2)(Ipd1p2U^1)[X]d1)(3/4)d1λd1cλd1.

Therefore, by Lemma A.3, under E0,

sinΘ(V^d,Vd)=sinΘ(V^d,VdQ˜d1)U^d1(Ipd1U^d2)(Ipd1p2U^1)[X]d1V^dsrd1(U^d1(Ipd1U^d2)(Ipd1p2U^1)[X]d1)2U^d1(Ipd1U^d2)(Ipd1p2U^1)[Z]d1srd1(U^d1(Ipd1U^d2)(Ipd1p2U^1)[X]d1)Ci=1dpiri1riλd1.

Suppose (38) holds for j + 1 ≤ kd. For k = j, since V^j is the right singular space of [X]j1(V^dIpjpd1)(V^j+1Ipj), there exists Q˜j1Orj1 such that

V˜jQ˜j1=SVDR(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj)).

By Lemma A.1, (35), (36) and (37), under E0

srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj))srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+2Ipjpj+1)(V˜j+1Ipj))smin((V˜j+1Ipj)(V^j+1Ipj))=srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+2Ipjpj+1))smin(V˜j+1V^j+1)srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1)smin(V˜dV^d)smin(V˜j+1V^j+1)smin(U^j1U˜j1)srj1((Ipj1U^j2)(Ip2pj1U^1)[X]j1)smin(V˜dV^d)smin(V˜j+1V^j+1)(34)j1λj1(34)djcλj1.

Note that V^jOpjrj,rj1 is the right singular space of U^j1(Ipj1U^j2)(Ip2pj1U^1)[Y]j1(V^dIpjpd1)(V^j+1Ipj) and

U^j1(Ipj1U^j2)(Ip2pj1U^1)[Y]j1(V^dIpjpd1)(V^j+1Ipj)=U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj)+U^j1(Ipj1U^j2)(Ip2pj1U^1)[Z]j1(V^dIpjpd1)(V^j+1Ipj),

By Lemma A.3, under E0,

sinΘ(V^j,V˜j)=sinΘ(V^j,V˜jQ˜j1)U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj)V^j/srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj))2U^j1(Ipj1U^j2)(Ip2pj1U^1)[Z]j1(V^dIpjpd1)(V^j+1Ipj)/srj1(U^j1(Ipj1U^j2)(Ip2pj1U^1)[X]j1(V^dIpjpd1)(V^j+1Ipj))C(i=1dpiriri1)1/2λj1.

Therefore, under E0, (38) holds.

Thus, we have finished the proof of Theorem IV.1.

D. Proof of Corollary IV.1

Let and Q = {(15), (34) hold}, then (Qc)C exp(cp) and

X^(t)XF2Ci=1dpiriri1      under Q.

Under Qc, due to the property of projection matrices, we know that

X^(t)FYFXF+ZF.

Moreover,

EX^(t)XF4C(EX^(t)F4+XF4)CXF4+CEZF4C exp(4c0p)+CE(χp1pd2)2C exp(4c0p)+C(p1pd)2 C exp(4c0p)+C exp(2c0p) C exp(4c0p).

Therefore, we have the following upper bound for the Frobenius norm risk of X^:

EX^(t)XF2=EX^(t)XF21Q+EX^(t)XF21QcCi=1dpiriri1+EX^(t)XF4(Qc)Ci=1dpiriri1+C exp((4c0c)p/2).

By selecting c0 < c/4, we have

EX^(t)XF2Ci=1dpiriri1

Therefore, we have finished the proof of Corollary IV.1.

E. Proof of Theorem IV.2

Since the i.i.d. Gaussian distribution, Z~N(0,σ2), is a special case of D and

infX^supXFp,r(λ),DDEZ~DX^XF2infX^supXFp,r(λ),Z~i.i.d.N(0,σ2)EZ~DX^XF2,

we only need to focus on the setting that Z~N(0,σ2) while developing the lower bound result.

Without loss of generality, assume σ2 = 1. Since d is a fixed number, we only need to show that for any 1 ≤ id,

infX^supXFp,r(λ)EX^XF2cpiriri1. (39)

Suppose X can be written as (1), Uj(pjrj1)×rj and Vj(pjrj)×rj1 are reshaped from Gjrj1×pj×rj, G1 = U1, Gd = Vd. For any 1 ≤ id − 1, by Lemma III.1, we have

[X]i=(Ip2piU1)(IpiUi1)UiVi+1(Vi+2Ipi+1)(VdIpi+1pd1). (40)

For all ji, 1 ≤ jd−1, let Uj~i.i.d.N(0,1), Vd~i.i.d.N(0,1) and U1, …, Ui−1, Ui+1, …, Ud−1, Vd are all independent. By Lemma A.1, for any 1 ≤ jd − 1, we have

srj((Ip2pjU1)(IpjUj1)Uj)smin(Ip2pjU1)smin(Uj)=sr1(U1)srj(Uj).

Similarly,

srj(Vj+1(Vj+2Ipj+1)(VdIpj+1pd1))srj(Vj+1)srd1(Vd).

Moreover, Lemma A.1 Part 1 tells us

srj((Ip2pjU1)(IpjUj1)UjVj+1(Vj+2Ipj+1)(VdIpj+1pd1))srj((Ip2pjU1)(IpjUj1)Uj)srj(Vj+1(Vj+2Ipj+1)(VdIpj+1pd1))sr1(U1)srj(Uj)srj(Vj+1)srd1(Vd). (41)

Recall that Vj is reshaped from Uj for all 1 ≤ jd − 1, by [93][Corollary 5.35], we know that with probability at least 1 − Cecp, for all 1 ≤ jd − 1, ji,

pjrj14pjrj1rjpjrj125srj(Uj)s1(Uj)pjrj1+rj+pjrj1252pjrj1,pjrj4srj1(Vj)s1(Vj)2pjrj, and      pd4srd1(Vd)sr1(Vd)2pd. (42)

For a fixed U0Opiri1,ri, define the following ball with radius ε > 0,

B(U0,ε)={UOpiri1,ri:sinΘ(U,U0)Fε}.

By Lemma 1 in [94], for 0 < α < 1 and 0 < ε ≤ 1, there exist U˜i(1),,U˜i(m)B(U0,ε) such that

m(c0α)ri(piri1ri),min1jkmsinΘ(U˜i(j),U~i(k))Fαε.

By Lemma 1 in [37], one can find a rotation matrix OkOri such that

U0U˜i(k)OkF2sinΘ(U0,U˜i(k))F2ε.

Let U˜i(k)=U˜i(k)Ok, we have

U˜i(k)U0F2ε,sinΘ(U˜i(j),U˜i(k))Fαε,     1j<km.

Let Ui(k)=S+U˜i(k), where S~i.i.d.N(0,τ2). Set τ8/pi, [93][Corollary 5.35] shows that with probability at least 1 − Cecp,

τpiri18τ(piri1ripiri125)1sri(S)s1(U˜i(k))sri(Ui(k))s1(Ui(k))s1(S)+s1(U˜i(k))τ(piri1+ri+piri125)+12τpiri1. (43)

If 2 ≤ id − 1, since Vi(k) is reshaped from Ui(k), we know that Vi(k)=T+V˜i(k), where T~i.i.d.N(0,τ2), and V˜i(k) is realigned from U˜i(k). Notice that

s1(V˜i(k))=V˜i(k)V˜i(k)F=U˜i(k)F=ri,

Since τ8/pi, by [93][Corollary 5.35], with probability at least 1Cecpiri,

τpiri8τ(piriri1piri25)risri(T)s1(V˜i(k))sri(Vi(k))s1(Vi(k))s1(T)+s1(V˜i(k))τ(piri+ri1+piri25)+ri2τpiri. (44)

Choose fixed U1, …, Ui−1, Vi+1, ⋯, Vd, S such that (42), (43) and (44) hold. Let

[X(k)]i=(Ip2piU1)(IpiUi1)Ui(k)Vi+1(Vi+2Ipi+1)(VdIpi+1pd1) (45)

and X(k)p1××pd is the corresponding tensor. (41), (42), (43) and (44) together show that

σrj([X(k)]j)τk=1jpkrk18k=j+1dpkrk8=τp1pdr1rd1Crj (46)

By setting τ=C max1id1λimax1jd1rjp1pdr1rd1  8 max1id11/pi, we have

σrj([X(k)]j)λj,     1jd1

For 1 ≤ k < jm,

X(k)X(j)F2=(Ip2piU1)(IpiUi1)(Ui(k)Ui(j))Vi+1(Vi+2Ipi+1)(VdIpi+1pd1)F2smin2((Ip2piU1)(IpiUi1))·(Ui(k)Ui(j))Vi+1(Vi+2Ipi+1)(VdIpi+1pd1)F2=sri12((Ip2pi1U1)Ui1)sri2(Vi+1(Vi+2Ipi+1)(VdIpi+1pd1))·Ui(k)Ui(j)F2=sri12((Ip2pi1U1)Ui1)sri2(Vi+1(Vi+2Ipi+1)(VdIpi+1pd1))·U˜i(k)U˜i(j)F2sr12(U1)sri12(Ui1)sri2(Vi+1)srd12(Vd)minOOriU˜i(k)U˜i(j)OF2h=1i1phrh116l=i+1dplrl16minOOriU˜i(k)U˜i(j)OF2h=1i1phrh116l=i+1dplrl16sin Θ(U˜i(k),U˜i(j))F2c(h=1i1phTh1l=i+1dplrl)α2ε2.

In addition, let Y(k)=X(k)+Z(k) and Z(k)~i.i.d.N(0,1). The KL-divergence between distributions Y(k) and Y(j) is

DKL(Y(k)Y(j))=12X(k)X(j)F2=12(Ip2ptU1)(IpiUi1)(Ui(k)Ui(j))Vi+1(Vi+2Ipi+1)(VdIpi+1pd1)F212(Ip2ptU1)(IptUi1)2Vi+1(Vi+2Ipi+1)(VdIpi+1pd1)2Ui(k)Ui(j)F212s12(U1)s12(Ui1)s12(Vi+1)s12(Vd)Ui(k)Ui(j)F212h=1i1(4phrh1)l=i+1d(4plrl)(Ui(k)U0F+Ui(k)U0F)2C(h=1i1(phrh1)l=i+1d(plrl))ε2.

By generalized Fano’s Lemma,

infX^supX{X(k)}k=1mEX^XFch=1i1phrh1l=i+1dplrlαε(1C(h=1i1(phrh1)l=i+1d(plrl))ε2+log2ri(piri1ri)log(c0/α)).

By setting ε=cri(piri1ri)Ch=1i1(phrh1)l=i+1d(plrl)12, α = (c0 ^ 1)/8, we know that for any 1 ≤ id − 1,

infX^supXFp,r(λ)EX^XF2(infX^supX{X(k)}k=1mEX^XF)2c1ripiri1.

For i = d, similarly to the case i = 1, we have

infX^supXFp,r(λ)EX^XF2c1pdrd1.

Therefore, we have proved Theorem IV.2.

F. Proof of Proposition V.1

Define G˜1p×r1, G˜krk1×p×rk, G˜dp×rd1 such that

G˜1,[i,l]=(G1(i))l,     i[p],l[r1],G˜k,[j,i,l]=(Gk(i,ej(rk1)))l,     i[p],j[rk1],l[rk],2kd1,G˜d,[i,l]=Gd(i,el(rd1)),i[p],l[rd1]

where ei(k) is the i-th canonical basis of k. Then

P˜1(Xt+1)=G˜1,[Xt+1,:]r1,P˜2(Xt+1,Xt+2)=G2(Xt+2,P˜1(Xt+1))=linear mapj=1r1G2(Xt+2,ej(r1))(P˜1(Xt+1))j=(G˜1,[Xt+1,:] G˜2,[:,Xt+2,:]).

By induction, for any 2 ≤ kd − 1,

P˜k(Xt+1,,Xt+k)=Gk(Xt+k,P˜k1(Xt+1,,Xt+k1)) =linear mapj=1rk1Gk(Xt+k,ej(rk1))(P˜k1(Xt+1,,Xt+k1))j=G˜k,[:,Xt+k,:]P˜k1(Xt+1,,Xt+k1)=(G˜1,[Xt+1,:] G˜2,[:,Xt+2,:]G˜k,[:,Xt+k,:])

and

(Xt+dXt+1,,Xt+d1)=Gd(Xt+d,P˜d1(Xt+1,,Xt+d1))=P˜d1(Xt+1,,Xt+d1)G˜d,[Xt+d,:]=G˜1,[Xt+1,:] G˜2,[:,Xt+2,:]G˜d1,[:,Xt+d1,:]G˜d,[Xt+d,:].

Therefore,

P=G˜1,G˜2,,G˜d1,G˜d

and has TT-rank (r1, …, rd−1).

G. Proof of Proposition V.2

Let Z=P^empP, then EZ=0. Let

Ti1,,id(k)=1{X(i1,,id1;k)=id},     1kn;1i1,,idp

and

Zi1,,id(k)=Ti1,,id(k)(idi1,,id1),1kn;1i1,,idp.

Then EZ(k)=0. Moreover, by definition, for any 1 ≤ jd − 1, the rows of [Z(k)]jpj×pdj are independent, and there exists a partition {Ω1(j),,Ωpdj1(j)} of {1, …, pdj} satisfying |Ω1(j)|==|Ωpdj1(j)|p=p, such that ([Z(k)]j)[:,Ω1(j)],,([Z(k)]j)[:,Ωpdj1(j)] are independent and

lΩi(j)([T(k)]j)m,l=1,     1mpj,1kn.

Therefore,

lΩi(j)|([Z(k)]j)m,l|lΩi(j)([T(k)]j)m,l+ElΩi(j)([T(k)]j)m,l=2,     1mpj,1kn.

For any fixed x1pj and x2pdj satisfying ∥x12 = 1 and ∥x2 = 1, we have

|lΩi(j)([Z(k)]j)m,l(x2)l|maxlΩi(j)(x2)llΩi(j)|([Z(k)]j)m,l|2maxlΩi(j)(x2)l2(x2)Ωi(j)2.

By [95, Exercise 2.4], lΩi(j)([Z(k)]j)m,l(x2)l is 2(x2)Ωi(j)2-sub-Gaussian. Therefore,

x1[Z(k)]jx2=m=1pj(x1)mi=1pdj1(lΩi(j)([Z(k)]j)m,l(x2)l)

is (m=1pj(x1)m2i=1pdj14(x2)Ωi(j)22)1/2=2x12x22=2-subGaussian. Notice that Z=1nk=1nZ(k), the Hoeffding bound [95, Proposition 2.5] shows that

(|x1[Z]jx2|t)2 exp(nt28),     t0.

Therefore, for any fixed UOpj,rj, VOpdj,prj+1, xrj, yprj+1 with ∥x2 = 1 and ∥y2 = 1,

(|xU[Z]jVy|t)2 exp(nt28),     t0.

Similarly to the proof of (49), with probability at least 1 − Cecp, for all 1 ≤ kd − 1,

U^k(0)(IpU^k1(0))(Ipk1U^1(0))[Z]k(V^d(1)Ipdk1)(V^k+2(1)Ip)Ci=1dpiriri1n.

Similarly, with probability at least 1 − Cecp,

[Z]1(V^d(1)Ipd2)(V^3(1)Ip)V^2(1)Ci=1dpiriri1n.

Notice that XFrX if rank(X) = r, by the previous two inequalities and Theorem III.1, we know that with probability at least 1 − Cecp,

P^(1)PF2C(max1id1ri)i=1dpiriri1n.

Finally, by the definition of P^, we have

P^PFP^(1)PF+P^(1)P^F2P^(1)PF,

which has finished the proof of Theorem V.2.

H. Proof of Lemma III.3

By symmetry, we only need to prove (6). By definition, (6) holds for k = 1. Suppose it holds for k = j. For k = j + 1, since Sj+1(rjpj+1)×(pj+2pd) is realigned from S˜j=MjSjrj×(pj+1pd), Lemma III.2 that Sj+1=(Ipj+1S˜j)A(pj+1,pj+2pd) where the realignment matrix A(i,j) is defined in (5). Therefore,

Sj+1=(Ipj+1 S˜j)A(pj+1,pj+2pd)=(Ipj+1MjSj)A(pj+1,pj+2pd)=(Ipj+1Mj)(Ipj+1Sj)A(pj+1,pj+2pd)=(Ipj+1Mj)(Ipj+1((IpjMj1)(Ip2pjM1)[T]j))A(pj+1,pj+2pd)=(Ipj+1Mj)(Ipj+1(IpjMj1))(Ipj+1(Ip2pjM1))(Ipj+1[T]j)A(pj+1,pj+2pd)=(Ipj+1Mj)(Ipjpj+1Mj1)(Ip2pj+1M1)[T]j+1.

The third equation and the fifth equation hold since (AB)(CD) = (AC) ⊗ (BD); the last equation holds since Yj+1=(Ipj+1Yj)A(pj+1,pj+2pd) and A ⊗ (BC) = (AB) ⊗ C.

Also notice that S˜k=MkSk, we have finished the proof of (6).

I. Technical Lemmas

We collect the additional technical lemmas in this section.

Lemma A.1.

  1. Suppose Am1×m2, Bm2×m3, where m1m2. Then
    smin{m2,m3}(AB)sm2(A)smin{m2,m3}(B).
  2. Suppose Am×p1, Bn×p2, Xp1×p2, rank(X) = r, p1m, p2n. If X=U1MV1, where U1Op1,m, and V1Op2,n, then
    σr(AXB)smin(AU1)σr(X)smin(V1B).

Proof of Lemma A.1. (1) Consider the SVD decomposition A=UAΣAVA, B=UBΣBVB, where UAOm1,m2, VAOm2, UBOm2,min{m2,m3}, VBOmin{m2,m3},m3, ΣA=diag(σ1(A),,sm2(A)) and ΣB=diag(s1(B),,smin{m2,m3}(B)) are diagonal matrices with nonnegative diagonal entries. Then

smin{m2,m3}(AB)=smin{m2,m3}(UAΣAVAUBΣBVB)=smin{m2,m3}(ΣAVAUBΣB).

For any xmin{m2,m3} satisfying ∥x2 = 1, we have

ΣAVAUBΣBx2sm2(A)VAUBΣBx2=sm2(A)ΣBx2sm2(A)smin{m2,m3}(B).

Therefore

smin{m2,m3}(AB)=smin{m2,m3}(ΣAVAUBΣB)sm2(A)smin{m2,m3}(B).

(2) Consider the SVD decomposition X = UΣV, where UOp1,r, VOp2,r and Σ is a diagonal matrix. Then we know that there exist two matrices Lm×r and Rn×r satisfying U = U1L and V = V1R. Moreover,

LL=LU1U1L=UU=Ir,RR=RV1V1R=VV=Ir.

Therefore,

σr(AXB)=σr(AU1LΣRV1B)smin(AU1)σr(LΣR)smin(V1B)=smin(AU1)σr(X)smin(V1B).

Lemma A.2. Suppose Z is a matrix with independent zero-mean σ-sub-Gaussian entries, d is a fixed number, r0 = rd = 1.

  1. Suppose Zp×q, Am×p, Bq×n satisfyA∥, ∥B∥ ≤ 1, mp, nq. Then
    (AZB2σm+t)25nexp[c min(t2m,t)]. (47)
    (AZBFσmn+t)2 exp[c min(t2mn,t)]. (48)
  2. Suppose Z(p1pk)×m, 2 ≤ kd − 1. Then
    maxUi(piri1)×riUi1(IpkUk1)(Ip2pkU1)ZCσi=1k1piri1ri+pkrk1+m. (49)
    with probability at least 1C exp(c(i=1k1piri1ri+pkrk1+m)).
  3. Suppose Z(p1pk)×(pk+1pd), 2 ≤ kd − 2. Then
    max(U1,,Vd)AUk(IpkUk1)(Ip2pkU1)Z(VdIpk+1pd1)(Vk+2Ipk+1)Cσi=1dpiri1ri (50)
    with probability at least 1C exp(ci=1dpiri1ri). Here,
    A={(U1,,Uk,Vk+2,,Vd):Ui(piri1)×ri,Ui1,Vj(piri)×ri1,Vj1}. (51)
  4. Suppose Z(p1pd1)×pd. Then with probability at least 1C exp(ci=1dpiri1ri),
    maxUi(piri1)×ri,Ui1Ud1(Ipd1Ud2)(Ip2pd1U1)ZFCσi=1dpiri1ri. (52)
  5. Suppose Z(p1pk)×(pk+1pd), 2 ≤ kd − 2. Then
    max(U1,,Vd)AUk(IpkUk1)(Ip2pkU1)Z·(VdIpk+1pd1)(Vk+2Ipk+1)FCσi=1dpiri1ri (53)
    with probability at least 1C exp(ci=1dpiri1ri). Here, A is defined in (51).

Proof of Lemma A.2. W.O.L.G., assume σ = 1.

  1. For fixed xn satisfying ∥x2 = 1, we have AZBx = (xBA)vec(Z). Since Zij is 1-sub-Gaussian, we know that Var(Zij) ≤ 1. In addition,
    E(xBA)vec(Z)22=E[tr(vec(Z)(xBA)(xBA)vec(Z))]=tr[E((xBA)(xBA)vec(Z)vec(Z))]=tr[(xBA)(xBA)E(vec(Z)vec(Z))]tr((xBA)(xBA))=xBAF2=Bx22AF2x22AF2m. (54)
    The first inequality holds since E(vec(Z)vec(Z)) is a diagonal matrix with diagonal entries Var(Zij) ≤ 1; the last inequality is due to ∥AF ≤ min{m, p}∥A2m. By Hanson-Wright inequality, we have
    (AZBx22mt)2 exp[c min(t2(BxxB)(AA)F2,t(BxxB)(AA))].
    Since ∥x2 = 1 and ∥A∥, ∥B∥ ≤ 1,
    (BxxB)(AA)F2=BxxBF2AAF2=(xBBx)2AAF2(xx)2AAF2=i=1min{m,p}σi4(A)m, (BxxB)(AA)BxxBAAxxAA1. 
    Thus, for fixed x satisfying ∥x2 = 1, we have
    (AZBx22m+t)2 exp[c min(t2m,t)]. (55)
    By [93][Lemma 5.2], there exists N1/2, a 1/2-net of {xn:x2=1}, such that |N1/2|5n. The union bound, [93][Lemma 5.2] and (55) together imply that
    (AZB2m+t)(maxxN1/2AZBx2m+t)25nexp[c min(t2m,t)].
    For ∥AZBF, note that AZB = (BA)vec(Z), Similarly to (54), we have
    E(BA)vec(Z)22=E[vec(Z)(BA)(BA)vec(Z)]=E{tr[vec(Z)(BA)(BA)vec(Z)]}=tr{E[(BA)(BA)vec(Z)vec(Z)]}=tr[(BA)(BA)E(vec(Z)vec(Z))]tr[(BA)(BA)]=BAF2=BF2AF2mn.
    By Hanson-Wright inequality, we have
    (AZBF2mnt)2 exp[c min(t2(BB)(AA)F2,t(BB)(AA))].
    Since ∥A∥, ∥B∥ ≤ 1, we have
    (BB)(AA)F=AAF2BBF2=i=1min{m,p}σi4(A)i=1min{q,n}σi4(B)mn,(BB)(AA)1.
    Therefore,
    (AZBF2mn+t)2 exp[c min(t2mn,t)].
  2. For fixed xm and A(pkrk1)×(p1pk) satisfying ∥x2 = 1 and ∥A∥ ≤ 1, by (47) with B = Im, we have
    (AZ2pkrk1+t)25mexp[c min(t2pkrk1,t)]. (56)
    By [48][Lemma 7], for 1 ≤ ik − 1, that exist ε-nets: Ui(1),,Ui(Ni)(piri1)×ri (here r0 = 1), Ni((2+ε)/ε)(piri1)×ri, such that
    U(piri1)×ri satisfying U1,1jNi s.t. Ui(j)Uε.
    Therefore,
    (maxi1,,ik1(IpkUk1(ik1))(Ip2pkU1(i1))Z2pkrk1+t)2((2+ε)/ε)i=1k1piri1ri5mexp[c min(t2pkrk1,t)]. (57)
    Let
    U1*,,Uk1* arg maxUi(piri1)×ri,Ui1,     1ik1(IpkUk1)(Ip2pkU1)Z,M=maxUi(piri1)×ri,Ui1,    1ik1(IpkUk1)(Ip2pkU1)Z.
    Then for any 1 ≤ ik − 1, there exists 1 ≤ jiNi, such that Ui(ji)Ui*ε. Then
    M=(IpkUk1*)(Ip2pkU1*)Z(IpkUk1(jk1))(Ip2pkU1(j1))Z+(Ipk(Uk1*Uk1(jk1)))(Ipk1pkUk2(jk2))(Ip2pkU1(j1))Z++(IpkUk1*)(Ip3pkU2*)(Ip2pk(U1*U1(j1)))Z(IpkUk1(jk1))(Ip2pkU1(j1))Z+ε(k1)M. (58)
    Combine (57) and the previous inequality together, we have
    (M2pkrk1+t1(k1)ε)2((2+ε)/ε)i=1k1piri1ri5mexp[c min(t2pkrk1,t)]. (59)
    By setting ε=12(k1) and t=Ci=1k1piri1ri+pkrk1+m, we have proved (49).
  3. For fixed Ark×(p1pk), B(pk+1pd)×(pk+1rk+1) satisfying ∥A∥ ≤ 1, ∥B∥ ≤ 1, by (47), we have
    (AZB2rk+t)25pk+1rk+1exp[c min(t2rk,t)].
    Let
    M=max(U1,,Vd)AUk(IpkUk1)(Ip2pkU1)Z(VdIpk+1pd1)(Vk+2Ipk+1),
    By similar arguments as (59), one has
    (M2rk+t1(d1)ε)2((2+ε)/ε)1id,ik+1piri1ri5pk+1rk+1exp[c min(t2rk,t)]
    for any 0<ϵ<1d. By setting ε=12(d1) and t=Ci=1dpiri1ri, we have proved the third part of Lemma A.2.
  4. For fixed U1, …, Ud−1 satisfying ∥Ui∥ ≤ 1, let A=Ud1(Ipd1Ud2)(Ip2pd1U1)rd1×(p1pd1), then ∥A∥ ≤ 1. By (48) with B=Ipd, we have
    (AZF2pdrd1+t)2 exp[c min(t2pdrd1,t)].
    Let
    M=maxUi(piri1)×ri,Ui1Ud1(Ipd1Ud2)(Ip2pd1U1)ZF.
    The similar proof of (59) leads us to
    (M2rd1pd+t(1ε(d1))2)2((2+ε)/ε)k=1d1pkrk1rkexp[c min(t2pdrd1,t)]. (60)
    for 0<ε<1d1. By setting ε=12(d1) and t=Ck=1dpkrk1rk, we have arrived at (52).
  5. For fixed Ark×(p1pk), B(pk+1pd)×(pk+1rk+1), ∥A∥ ≤ 1, ∥B∥ ≤ 1, by (48), we have
    (AZBF2pk+1rk+1rk+t)2 exp[c min(t2pk+1rk+1rk,t)].
    Let
    M=max(U1,,Vd)AUk(IpkUk1)(Ip2pkU1)Z(VdIpk+1pd1)(Vk+2Ipk+1)F.
    Similarly to (59), for any 0<ε<1d1, we have
    (Mpk+1rk+1rk+t1(d1)ε)2((2+ε)/ε)1id,ik+1piri1riexp[c min(t2pk+1rk+1rk,t)]. (61)
    By setting ε=12(d1) and t=Ci=1dpiri1ri, we have proved (53). □

Lemma A.3. Suppose X, Zp1×p2, rank(X) = r. Let Y = X + Z, U^=SVDrL(Y), V^=SVDrR(Y). Then we have

max{U^X,XV^} 2Z, max{U^XF,XV^F}2min{ZF,rZ}.

Proof of Lemma A.3. See [48, Lemma 6] and [96, Theorem 1]. □

Footnotes

Contributor Information

Yuchen Zhou, Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA.

Anru R. Zhang, Departments of Biostatistics & Bioinformatics, Computer Science, Mathematics, and Statistical Science, Duke University, Durham, NC 27710, USA

Lili Zheng, Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA.

Yazhen Wang, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.

REFERENCES

  • [1].Oseledets IV, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011. [Google Scholar]
  • [2].Bi X, Qu A, and Shen X, “Multilayer tensor factorization with applications to recommender systems,” The Annals of Statistics, vol. 46, no. 6B, pp. 3308–3333, 2018. [Google Scholar]
  • [3].Nasiri M, Rezghi M, and Minaei B, “Fuzzy dynamic tensor decomposition algorithm for recommender system,” UCT Journal of Research in Science, Engineering and Technology, vol. 2, no. 2, pp. 52–55, 2014. [Google Scholar]
  • [4].Wozniak JR, Krach L, Ward E, Mueller BA, Muetzel R, Schnoebelen S, Kiragu A, and Lim KO, “Neurocognitive and neuroimaging correlates of pediatric traumatic brain injury: a diffusion tensor imaging (dti) study,” Archives of Clinical Neuropsychology, vol. 22, no. 5, pp. 555–568, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]
  • [7].Oseledets IV and Tyrtyshnikov EE, “Breaking the curse of dimensionality, or how to use svd in many dimensions,” SIAM Journal on Scientific Computing, vol. 31, no. 5, pp. 3744–3759, 2009. [Google Scholar]
  • [8].Cichocki A, Mandic D, De Lathauwer L, Zhou G, Zhao Q, Caiafa C, and Phan HA, “Tensor decompositions for signal processing applications: From two-way to multiway component analysis,” IEEE signal processing magazine, vol. 32, no. 2, pp. 145–163, 2015. [Google Scholar]
  • [9].Mondelli M and Montanari A, “On the connection between learning two-layer neural networks and tensor decomposition,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1051–1060. [Google Scholar]
  • [10].Zhong K, Song Z, and Dhillon IS, “Learning non-overlapping convolutional neural networks with multiple kernels,” arXiv preprint arXiv:1711.03440, 2017. [Google Scholar]
  • [11].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing. IEEE, 2010, pp. 517–520. [Google Scholar]
  • [12].Zhang C, Han R, Zhang AR, and Voyles PM, “Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition,” Ultramicroscopy, vol. 219, p. 113123, 2020. [DOI] [PubMed] [Google Scholar]
  • [13].Bhattacharya A and Dunson DB, “Simplex factor models for multi-variate unordered categorical data,” Journal of the American Statistical Association, vol. 107, no. 497, pp. 362–377, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Dunson DB and Xing C, “Nonparametric bayes modeling of multi-variate categorical data,” Journal of the American Statistical Association, vol. 104, no. 487, pp. 1042–1051, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Calvi GG, Moniri A, Mahfouz M, Yu Z, Zhao Q, and Mandic DP, “Tucker tensor layer in fully connected neural networks,” arXiv preprint arXiv:1903.06133, 2019. [Google Scholar]
  • [16].Novikov A, Podoprikhin D, Osokin A, and Vetrov DP, “Tensorizing neural networks,” in Advances in neural information processing systems, 2015, pp. 442–450. [Google Scholar]
  • [17].Novikov A, Rodomanov A, Osokin A, and Vetrov D, “Putting mrfs on a tensor train,” in International Conference on Machine Learning, 2014, pp. 811–819. [Google Scholar]
  • [18].Fannes M, Nachtergaele B, and Werner RF, “Finitely correlated states on quantum spin chains,” Communications in mathematical physics, vol. 144, no. 3, pp. 443–490, 1992. [Google Scholar]
  • [19].Oseledets I, “A new tensor decomposition,” in Doklady Mathematics, vol. 80, no. 1. Pleiades Publishing, Ltd., 2009, pp. 495–496. [Google Scholar]
  • [20].Oseledets I and Tyrtyshnikov E, “Recursive decomposition of multidimensional tensors,” in Doklady Mathematics, vol. 80, no. 1. Springer, 2009, pp. 460–462. [Google Scholar]
  • [21].Orús R, “Tensor networks for complex quantum systems,” Nature Reviews Physics, vol. 1, no. 9, pp. 538–550, 2019. [Google Scholar]
  • [22].Bravyi S, Gosset D, and Movassagh R, “Classical algorithms for quantum mean values,” Nature Physics, vol. 17, no. 3, pp. 337–341, 2021. [Google Scholar]
  • [23].Rakhuba M and Oseledets I, “Calculating vibrational spectra of molecules using tensor train decomposition,” The Journal of Chemical Physics, vol. 145, no. 12, p. 124101, 2016. [DOI] [PubMed] [Google Scholar]
  • [24].Schollwöck U, “The density-matrix renormalization group in the age of matrix product states,” Annals of physics, vol. 326, no. 1, pp. 96–192, 2011. [Google Scholar]
  • [25].Stoudenmire E and Schwab DJ, “Supervised learning with tensor networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4799–4807. [Google Scholar]
  • [26].Bigoni D, Engsig-Karup AP, and Marzouk YM, “Spectral tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 38, no. 4, pp. A2405–A2439, 2016. [Google Scholar]
  • [27].Oseledets I and Tyrtyshnikov E, “Tt-cross approximation for multidimensional arrays,” Linear Algebra and its Applications, vol. 432, no. 1, pp. 70–88, 2010. [Google Scholar]
  • [28].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, pp. 1–39, 2013. [Google Scholar]
  • [29].Dolgov SV and Savostyanov DV, “Alternating minimal energy methods for linear systems in higher dimensions,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. A2248–A2271, 2014. [Google Scholar]
  • [30].Song Z, Woodruff DP, and Zhong P, “Relative error tensor low rank approximation,” in Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019, pp. 2772–2789. [Google Scholar]
  • [31].Li L, Yu W, and Batselier K, “Faster tensor train decomposition for sparse data,” Journal of Computational and Applied Mathematics, vol. 405, p. 113972, 2022. [Google Scholar]
  • [32].Lubich C, Rohwedder T, Schneider R, and Vandereycken B, “Dynamical approximation by hierarchical tucker and tensor-train tensors,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 2, pp. 470–494, 2013. [Google Scholar]
  • [33].Grasedyck L, Kluge M, and Kramer S, “Variants of alternating least squares tensor completion in the tensor train format,” SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. A2424–A2450, 2015. [Google Scholar]
  • [34].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]
  • [35].Steinlechner MM, “Riemannian optimization for solving highd-imensional problems with low-rank tensor structure,” EPFL, Tech. Rep, 2016. [Google Scholar]
  • [36].Novikov A, Izmailov P, Khrulkov V, Figurnov M, and Oseledets IV, “Tensor train decomposition on tensorflow (t3f).” Journal of Machine Learning Research, vol. 21, no. 30, pp. 1–7, 2020.34305477 [Google Scholar]
  • [37].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60–89, 2018. [Google Scholar]
  • [38].Candes EJ, Sing-Long CA, and Trzasko JD, “Unbiased risk estimates for singular value thresholding and spectral estimators,” IEEE transactions on signal processing, vol. 61, no. 19, pp. 4643–4657, 2013. [Google Scholar]
  • [39].Donoho D and Gavish M, “Minimax risk of matrix denoising by singular value thresholding,” The Annals of Statistics, vol. 42, no. 6, pp. 2413–2440, 2014. [Google Scholar]
  • [40].Cai J-F, Candès EJ, and Shen Z, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on optimization, vol. 20, no. 4, pp. 1956–1982, 2010. [Google Scholar]
  • [41].Chatterjee S, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, 2015. [Google Scholar]
  • [42].Klopp O, “Matrix completion by singular value thresholding: sharp bounds,” Electronic journal of statistics, vol. 9, no. 2, pp. 2348–2369, 2015. [Google Scholar]
  • [43].Zhang H, Cheng L, and Zhu W, “A lower bound guaranteeing exact matrix completion via singular value thresholding algorithm,” Applied and Computational Harmonic Analysis, vol. 31, no. 3, pp. 454–459, 2011. [Google Scholar]
  • [44].Nadler B, “Finite sample approximation results for principal component analysis: A matrix perturbation approach,” The Annals of Statistics, vol. 36, no. 6, pp. 2791–2817, 2008. [Google Scholar]
  • [45].Zhang A and Wang M, “Spectral state compression of markov processes,” IEEE Transactions on Information Theory, vol. 66, no. 5, pp. 3202–3231, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].De Lathauwer L, De Moor B, and Vandewalle J, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000. [Google Scholar]
  • [47].——, “On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000. [Google Scholar]
  • [48].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]
  • [49].Vannieuwenhoven N, Vandebril R, and Meerbergen K, “A new truncation strategy for the higher-order singular value decomposition,” SIAM Journal on Scientific Computing, vol. 34, no. 2, pp. A1027–A1052, 2012. [Google Scholar]
  • [50].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, pp. 1–34, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Kolda TG and Bader BW, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009. [Google Scholar]
  • [52].Sharan V and Valiant G, “Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use,” in International Conference on Machine Learning, 2017, pp. 3095–3104. [Google Scholar]
  • [53].Leurgans SE, Ross RT, and Abel RB, “A decomposition for three-way arrays,” SIAM Journal on Matrix Analysis and Applications, vol. 14, no. 4, pp. 1064–1083, 1993. [Google Scholar]
  • [54].Rajih M, Comon P, and Harshman RA, “Enhanced line search: A novel method to accelerate parafac,” SIAM journal on matrix analysis and applications, vol. 30, no. 3, pp. 1128–1147, 2008. [Google Scholar]
  • [55].Colombo N and Vlassis N, “Tensor decomposition via joint matrix schur decomposition,” in International Conference on Machine Learning, 2016, pp. 2820–2828. [Google Scholar]
  • [56].Anandkumar A, Deng Y, Ge R, and Mobahi H, “Homotopy analysis for tensor pca,” in Conference on Learning Theory. PMLR, 2017, pp. 79–104. [Google Scholar]
  • [57].Arous GB, Mei S, Montanari A, and Nica M, “The landscape of the spiked tensor model,” Communications on Pure and Applied Mathematics, vol. 72, no. 11, pp. 2282–2330, 2019. [Google Scholar]
  • [58].Hopkins SB, Shi J, and Steurer D, “Tensor principal component analysis via sum-of-square proofs,” in Conference on Learning Theory, 2015, pp. 956–1006. [Google Scholar]
  • [59].Luo Y and Zhang AR, “Tensor clustering with planted structures: Statistical optimality and computational limits,” arXiv preprint arXiv:2005.10743, 2020. [Google Scholar]
  • [60].Perry A, Wein AS, and Bandeira AS, “Statistical limits of spiked tensor models,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 56, no. 1. Institut Henri Poincaré, 2020, pp. 230–264. [Google Scholar]
  • [61].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems, 2014, pp. 2897–2905. [Google Scholar]
  • [62].Lesieur T, Miolane L, Lelarge M, Krzakala F, and Zdeborová L, “Statistical and computational phase transitions in spiked tensor estimation,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 511–515. [Google Scholar]
  • [63].Allen G, “Sparse higher-order principal components analysis,” in Artificial Intelligence and Statistics, 2012, pp. 27–36. [Google Scholar]
  • [64].Allen GI, “Regularized tensor factorizations and higher-order principal components analysis,” arXiv preprint arXiv:1202.2476, 2012. [Google Scholar]
  • [65].Liu Y, Chen L, and Zhu C, “Improved robust tensor principal component analysis via low-rank core matrix,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1378–1389, 2018. [Google Scholar]
  • [66].Lu C, Feng J, Chen Y, Liu W, Lin Z, and Yan S, “Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5249–5257. [Google Scholar]
  • [67].——, “Tensor robust principal component analysis with a new tensor nuclear norm,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 925–938, 2019. [DOI] [PubMed] [Google Scholar]
  • [68].Zhou P and Feng J, “Outlier-robust tensor pca,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2263–2271. [Google Scholar]
  • [69].Luo Y, Raskutti G, Yuan M, and Zhang AR, “A sharp blockwise tensor perturbation bound for orthogonal iteration,” Journal of machine learning research, vol. 22, no. 179, pp. 1–48, 2021. [Google Scholar]
  • [70].Wein AS, El Alaoui A, and Moore C, “The kikuchi hierarchy and tensor pca,” in 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2019, pp. 1446–1468. [Google Scholar]
  • [71].Benson AR, Gleich DF, and Lim L-H, “The spacey random walk: A stochastic process for higher-order data,” SIAM Review, vol. 59, no. 2, pp. 321–345, 2017. [Google Scholar]
  • [72].Raftery AE, “A model for high-order markov chains,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 47, no. 3, pp. 528–539, 1985. [Google Scholar]
  • [73].Tsay RS, Analysis of financial time series. John wiley & sons, 2005, vol. 543. [Google Scholar]
  • [74].Zhao J and Sun S, “High-order gaussian process dynamical models for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 2014–2019, 2016. [Google Scholar]
  • [75].Berchtold A and Raftery AE, “The mixture transition distribution model for high-order markov chains and non-gaussian time series,” Statistical Science, pp. 328–356, 2002. [Google Scholar]
  • [76].Ganguly A, Petrov T, and Koeppl H, “Markov chain aggregation and its applications to combinatorial reaction networks,” Journal of mathematical biology, vol. 69, no. 3, pp. 767–797, 2014. [DOI] [PubMed] [Google Scholar]
  • [77].Du Z, Ozay N, and Balzano L, “Mode clustering for markov jump systems,” in 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2019, pp. 126–130. [Google Scholar]
  • [78].Sanders J, Proutière A, and Yun S-Y, “Clustering in block markov chains,” The Annals of Statistics, vol. to appear, 2020. [Google Scholar]
  • [79].Zhu Z, Li X, Wang M, and Zhang A, “Learning Markov models via low-rank optimization,” arXiv preprint arXiv:1907.00113, 2019. [Google Scholar]
  • [80].Kearns MJ and Singh SP, “Finite-sample convergence rates for q-learning and indirect algorithms,” in Advances in neural information processing systems, 1999, pp. 996–1002. [Google Scholar]
  • [81].Duchi J, Shalev-Shwartz S, Singer Y, and Chandra T, “Efficient projections onto the l 1-ball for learning in high dimensions,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 272–279. [Google Scholar]
  • [82].Han R, Luo Y, Wang M, and Zhang AR, “Exact clustering in tensor block model: Statistical optimality and computational limit,” arXiv preprint arXiv:2012.09996, 2020. [Google Scholar]
  • [83].Liu Y, Wang F, Xiao Y, and Gao S, “Urban land uses and traffic source-sink areas: Evidence from gps-enabled taxi data in shanghai,” Landscape and Urban Planning, vol. 106, no. 1, pp. 73–87, 2012. [Google Scholar]
  • [84].Li SZ, Markov random field modeling in image analysis. Springer Science & Business Media, 2009. [Google Scholar]
  • [85].Zhang Y, Brady M, and Smith S, “Segmentation of brain mr images through a hidden markov random field model and the expectation-maximization algorithm,” IEEE transactions on medical imaging, vol. 20, no. 1, pp. 45–57, 2001. [DOI] [PubMed] [Google Scholar]
  • [86].Wei Z and Li H, “A markov random field model for network-based analysis of genomic data,” Bioinformatics, vol. 23, no. 12, pp. 1537–1544, 2007. [DOI] [PubMed] [Google Scholar]
  • [87].Chaplot DS, Bhattacharyya P, and Paranjape A, “Unsupervised word sense disambiguation using markov random field and dependency parser,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Google Scholar]
  • [88].Wainwright MJ and Jordan MI, “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008. [Google Scholar]
  • [89].Duan Y, Wang M, Wen Z, and Yuan Y, “Adaptive low-nonnegative-rank approximation for state aggregation of markov chains,” SIAM Journal on Matrix Analysis and Applications, vol. 41, no. 1, pp. 244–278, 2020. [Google Scholar]
  • [90].Puterman ML, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]
  • [91].Singh SP, Jaakkola T, and Jordan MI, “Reinforcement learning with soft state aggregation,” in Advances in neural information processing systems, 1995, pp. 361–368. [Google Scholar]
  • [92].Sutton RS and Barto AG, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135. [Google Scholar]
  • [93].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
  • [94].Cai TT, Ma Z, and Wu Y, “Sparse pca: Optimal rates and adaptive estimation,” The Annals of Statistics, vol. 41, no. 6, pp. 3074–3110, 2013. [Google Scholar]
  • [95].Wainwright MJ, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019, vol. 48. [Google Scholar]
  • [96].Luo Y, Han R, and Zhang AR, “A schatten-q low-rank matrix perturbation analysis via perturbation projection error bound,” Linear Algebra and its Applications, vol. 630, pp. 225–240, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0024379521002962 [Google Scholar]

RESOURCES