Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Sep 1.
Published in final edited form as: IEEE Trans Inf Theory. 2020 Mar 23;66(9):5927–5964. doi: 10.1109/tit.2020.2982499

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Botao Hao 1, Anru Zhang 2, Guang Cheng 3
PMCID: PMC7978041  NIHMSID: NIHMS1621480  PMID: 33746244

Abstract

In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.

I. Introduction

The rapid advance in modern scientific technology gives rise to a wide range of high-dimensional tensor data [1, 2]. Accurate estimation and fast communication/processing of tensor-valued parameters are crucially important in practice. For example, a tensor-valued predictor which characterizes the association between brain diseases and scientific measurements becomes the point of interest [3, 4, 5]. Another example is the tensor-valued image acquisition algorithm that can considerably reduce the number of required samples by exploiting the compressibility property of signals [6, 7].

The following tensor estimation model is widely considered in recent literatures,

yi=J*,Xi+ϵi,  i=1,,n. (I.1)

Here, Xi and ϵi are the measurement tensor and the noise, respectively. The goal is to estimate the unknown tensor J* from measurements {yi,Xi}i=1n. A number of specific settings with varying forms of Xi have been studied, e.g., tensor completion [8, 9, 10, 11, 12, 13, 14, 15], tensor regression [5, 3, 4, 16, 17, 18, 19], multi-task learning [20], etc.

In this paper, we focus on the case that the measurement tensor can be written in a cubic sketching form. For example, Xi=xixixi or Xi=uiviwi, depending on whether J* is symmetric or not. The cubic sketching form of Xi is motivated by a number of applications.

  • Interaction effect estimation: High-dimensional high-order interaction models have been considered under a variety of settings [21, 22, 23, 24]. By writing Xi=xixixi, we find that the interaction model has an interesting tensor representation (see left panel of Figure 1) which allows us to estimate high-order interaction terms using tensor techniques. This is in contrast with the existing literature that mostly focused on pair-wise interactions due to the model complexity and computational difficulties. More detailed discussions will be provided in Section V.

  • High-order imaging/video compression: High-order imaging/video compression is an important task in modern digital imaging with various applications (see right panel of Figure 1), such as hyper-spectral imaging analysis [25] and facial imaging recognition [26]. One could use Gaussian ensembles for compression such that each entry of Xi is i.i.d. randomly generated [3, 16, 17]. In contrast, the non-symmetric cubic sketchings, i.e., Xi=uiviwi, reduce the memory storage from O(np1p2p3) to O(n(p1 +p2 +p3)) (n is the sample size and (p1,p2,p3) is the tensor dimension), but still preserve the optimal statistical rate. More detailed discussions will be provided in Section VI.

In practice, the total number of measurements n is considerably smaller than the number of parameters in the unknown tensor J*, due to all kinds of restrictions such as time and storage. Fortunately, a variety of high-dimensional tensor data possess intrinsic structures, such as low-rankness [2] and sparsity [27]. This could highly reduce the effective dimension of the parameter and make the accurate estimation possible. Please refer to (III.2) and (VI.2) for low-rankness and sparsity assumptions.

Fig. 1.

Fig. 1.

Illustration for interaction reformulation and tensor image/video compression

In this paper, we propose a computationally efficient non-convex optimization approach for sparse and low-rank tensor estimation via cubic-sketchings. Our procedure is two-stage:

  1. obtain an initial estimate via the method of tensor moment (motivated by high-order Stein’s identity), and then apply sparse tensor decomposition to the initial estimate to output a warm start;

  2. use a thresholded gradient descent to iteratively refine the warm start in each tensor mode until convergence.

Theoretically, we carefully characterize the optimization and statistical errors at each iteration step. The output estimate is shown to converge in a geometric rate to an estimation with minimax optimal rate in statistical error (in terms of tensor Frobenius norm). In particular, after a logarithmic number of iterations, whenever nK2(s log(ep/s))32, the proposed estimator J^ achieves

J^J*F2Cσ2K s log(p/s)n (I.2)

with high probability, where s, K, p, and σ2 are the sparsity, rank, dimension, and noise level, respectively. We further establish the matching minimax lower bound to show that (I.2) is indeed optimal over a large class of sparse low-rank tensors. Our optimality result can be further extended to the non-sparse case (such as tensor regression [3, 17, 28, 29]) – to the best of our knowledge, this is the first statistical rate optimality result in both sparse and non-sparse low-rank tensor regressions.

The above theoretical analyses are non-trivial due to the non-convexity of the empirical risk function, and the need to develop some new high-order sub-Gaussian concentration inequalities. Specifically, the empirical risk function in consideration satisfies neither restricted strong convexity (RSC) condition nor sparse eigenvalue (SE) condition in general. Thus, many previous results, such as the one based on local optima analysis [30, 31, 17], are not directly applicable. Moreover, the structure of cubic-sketching tensor leads to high-order products of sub-Gaussian random variables. Thus, the matrix analysis based on Hoeffding-type or Bernstein-type concentration inequality [32, 33] will lead to sub-optimal statistical rate and sample complexity. This motivates us to develop new high-order concentration inequalities and sparse tensor-spectral-type bound, i.e., Lemmas 1 and 2 in Section IV-C. These new technical results are obtained based on the careful partial truncation of high-order products of sub-Gaussian random variables and the argument of bounded ψα-norm [34], and may be of independent interest.

The literature on low-rank matrix estimation methods, e.g., the spectral method and nuclear norm minimization [35, 36, 37], is also related to this work. However, our cubic sketching model is by-no-means a simple extension from matrix estimation problems. In general, many related concepts or methods for matrix data, such as singular value decomposition, are problematic to apply in the tensor framework [38, 39]. It is also found that simple unfolding or matricizing of tensors may lead to suboptimal results due to the loss of structural information [40]. Technically, the tensor nuclear norm is NP-hard to even approximate [9, 10, 41], and thus the method to handle tensor low-rankness is distinct from the matrix.

The rest of the paper is organized as follows. Section II provides preliminaries on notation and basic knowledge of tensor. A two-stage method for symmetric tensor estimation is proposed in Section III, with the corresponding theoretical analysis given in Section IV. A concrete application to high-order interaction effect models is described in Section V. The non-symmetric tensor estimation model is introduced and discussed in Section VI. Numerical analysis is provided in Section VII to support the proposed procedure and theoretical results of this paper. Section VIII discusses extensions to higher-order tensors. The proofs of technical results are given in supplementary materials.

II. Preliminary

Throughout the paper, vector, matrix, and tensor are denoted by boldface lower-case letters (e.g., x,y), boldface upper-case letters (e.g., X,Y), and script letters (e.g., X, Y), respectively. For any set A, let |A| be the cardinality. The diag(x) is a diagonal matrix generated by x. For two vectors x and y, xy is the outer product. Define ‖xq ≔ (|x1|q + ⋯ + |xp|q)1/q. We also define the l0 quasi-norm by ‖x0 = #{j : xj ≠ 0} and l norm by max1≤jp |xj|. Denote the set {1,2,…,n} by [n]. Let ej be the canonical vectors, whose j-th entry equals to 1 and all other entries equal to zero. For any two sequences {an}n=1, {bn}n=1, we say an=O(bn) if there exists some positive constant C0 and sufficiently large n0 such that |an| ≤ C0bn for all nn0. We also write anbn if there exists C,c > 0 such that canbnCan for all n ≥ 1. Additionally, C1,C2, …, c1,c2, … are generic constants, whose actual values may be different from line to line.

We next introduce notations and operations on the matrix. For matrices A=[a1,,aJ]I×J and B=[b1,,bL]K×L, their Kronecker product is defined as a (IK)-by-(JL) matrix AB = [a1BaJ ⊗ B], where aj ⊗ B = (aj1B,…,ajIB). If A and B have the same number of columns J = L, the Khatri-Rao product is defined as AB=[a1b1,a2b2,,aJbJ]IK×J. If the matrices A and B are of the same dimension, the Hadamard product is their element-wise matrix product, such that (A*B)ij = Aij ·Bij. For matrix X=[x1xn]m×n, we also denote the vectorization vec(X)=(x1,,xn)1×mn and column-wise l2 norms as Norm(X)=(x12,,xn2)1×n.

In the end, we focus on tensor notation and relevant operations. Interested readers are referred to [2] for more details. Suppose Xp1×p2×p3 is an order-3 tensor. Then the (i,j,k)-th element of X is denoted by [X]ijk. The successive tensor multiplication with vectors up2, vp3 is denoted by X×2u×3v=j[p2],l[p3]ujvlX[:,j,l]p1. We say Xp1×p2×p3 is rank-one if it can be written as the outer product of three vectors, i.e., X=x1x2x3 or [X]ijk=x1ix2jx3k for all i,j,k. Here “∘” represents the vector outer product. We say X is symmetric if [X]ijk=[X]ikj=[X]jik=[X]jki=[X]kij=[X]kji for all i,j,k. Then, X is rank-one and symmetric if and only if it can be decomposed as X=xxx for some vector x.

More generally, we may decompose a tensor as the sum of rank one tensors as follows,

X=k=1Kηkx1kx2kx3k, (II.1)

where ηk, x1kSp11, x2kSp21, x3kSp31. This is the so-called CANDECOMP/PARAFAC, or CP decomposition [2] with CP-rank being defined as the minimum number K such that (II.1) holds. Then, {x1k}k=1K,{x2k}k=1K,{x3k}k=1K are called factors along first, second and third mode. Note that factors are normalized as unit vectors to guarantee the uniqueness of decomposition, and η = {η1,…,ηK} plays an analogous role of singular values in matrix value decomposition here. Several tensor norms also need to be introduced. The tensor Frobenius norm and tensor spectral norm are defined respectively as

XF=i=1p1j=1p2k=1p3Xijk2Xop:=supup1,vp2,wp3|X,u  v  w|u2v2w2, (II.2)

where X,Y=i,j,kXijkYijk. Clearly, XF2=X,X. We also consider the following sparse tensor spectral norm,

Xs:=supa=b=c=1max{a0,b0,c0}s|X,a  b  c|. (II.3)

By definition, XsXop. Suppose X=x1x2x3 and Y=y1y2y3 are two rank-one tensors. Then it is easy to check XF=x12x22x32 and X,Y=(x1y1)(x2y2)(x3y3).

III. Symmetric Tensor Estimation Via Cubic Sketchings

In this section, we focus on the estimation of sparse and low-rank symmetric tensors,

yi=J*,Xi+ϵi,Xi=xixixip×p×p,  i=1,,n, (III.1)

where xi are random vectors with i.i.d. standard normal entries. As previously discussed, the tensor parameter J* often satisfies certain low-dimensional structures in practice, among which the factor-wise sparsity and low-rankness [16] commonly appear. We thus assume J* is CP rank-K for Kp and the corresponding factors are sparse,

J*=k=1Kηk*βk*βk*βk*,with βk*2=1,βk*0s,k[K]. (III.2)

The CP low-rankness has been widely assumed in literature for its nice scalability and simple formulation [5, 25, 18]. Different from the matrix factor analysis, we do not assume the tensor factors βk* here are orthogonal. On the other hand, since the low-rank tensor estimation is NP-hard in general [42], we will introduce an incoherence condition in the forthcoming Condition 3 to ensure that the correlation among different factors βk* is not too strong. Such a condition has been used in recent literature on tensor data analysis [43], compressed sensing [44], matrix decomposition [45], and dictionary learning [46].

Based on observations {yi,Xi}i=1n, we propose to estimate J* via minimizing the empirical squared loss since the close-form gradient provides computational convenience,

J^=argminJL(J)  subject to J is sparse and low-rank, (III.3)

where

L(J)=L(ηk,β1,,βK)=1ni=1n(yiJ,Xi)2=1ni=1n(yik=1Kηk(xiβk)3)2. (III.4)

Equivalently, (III.3) can be written as,

minηk,βk1ni=1n(yik=1Kηk(xiβk)3)2,s.t. βk2=1,βk0s, for k[K]. (III.5)

Clearly, (III.5) is a non-convex optimization problem. To solve it, we propose a two-stage method as described in the next two subsections.

A. Initialization

Due to the non-convexity of (III.5), a straightforward implementation of many local search algorithms, such as gradient descent and alternating minimization, may easily get trapped into local optimums and result in sub-optimal statistical performance. Inspired by recent advances of spectral method (e.g., EM algorithm [47], phase retrieval [48], and tensor SVD [39]), we propose to evaluate an initial estimate {ηk(0),βk(0)} via the method of moment and sparse tensor decomposition (a variant of high-order spectral method) in the following Steps 1 and 2, respectively. The pseudo-code is given in Algorithm 1.

Step 1: Unbiased Empirical Moment Estimator.

Construct the empirical moment-based estimator Ts,

Ts:=16[1ni=1nyixixixij=1p(m1ejej+ejm1ej+ejejm1)], (III.6)

where m1:=1ni=1nyixi, ej is the canonical vector.

Based on Lemma 4, Ts is an unbiased estimator of J*. The construction of (III.6) is motivated by the high-order Stein’s identity ([49]; also see Theorem 7 for a complete statement). Intuitively speaking, based on the third-order score function of a Gaussian random vector x:S3(x)=xxxj=1p(xejej+ejxej+ejejx), we can construct the unbiased estimator of J* by properly choosing a continuously differentiable function in high-order Stein’s identity. See the proof of Lemma 4 for details.

Step 2: Sparse Tensor Decomposition.

Based on the method of moment estimator obtained in Step 1, we further obtain good initialization for the factors {ηk(0),βk(0)} via truncation and alternating rank-1 power iterations [27, 50],

Tsk=1Kηk(0)βk(0)βk(0)βk(0).

Note that the tensor power iterations recover one rank-1 component per time. To identify all rank-1 components, we generate a large number of different initialization vectors, implement a clustering step, and choose the centroids as the estimates in the initialization stage. This scheme originally appears in tensor decomposition literature [43, 50], although our problem setting and proof techniques are very different. This procedure is also very different from the matrix setting since the rank-1 component in singular value decomposition is mutually orthogonal, but we do not enforce the exact orthogonality here for J*.

More specifically, we first choose a large integer MK and generate M starting vectors {bm(0)}m=1Mp through sparse SVD as described in Algorithm 3. Then for each bm(0), we apply the following truncated power updates for l = 0,…

b˜m(l+1)=Ts×2bm(l)×3bm(l)Ts×2bm(l)×3bm(l)2,  bm(l+1)=Td(b˜m(l+1))Td(b˜m(l+1))2,

where ×2, ×3 are tensor multiplication operators defined in Section II and Td(x) is a truncation operator that sets all but the largest d entries in absolute values to zero for any vector x. It is noteworthy that the symmetry of Ts implies

Ts×2bm(l)×3bm(l)=Ts×1bm(l)×3bm(l)=Ts×1bm(l)×2bm(l).

This means the multiplications along different modes are the same. We run power iterations till its convergence, and denote bm as the outcome. Finally, we apply K-means to partition {bm}m=1M into K clusters, let the centroids of the output clusters be {βk(0)}k=1K, and calculate ηk(0)=Ts×1βk(0)×2βk(0)×3βk(0) for k[K].

graphic file with name nihms-1621480-t0001.jpg

B. Thresholded Gradient Descent

After obtaining a warm start in the first stage, we propose to apply the thresholded gradient descent to iteratively refine the solution to the non-convex optimization problem (III.5). Specifically, denote X=(x1,,xn)p×n, y=(y1,,yn)n, η=(η1,,ηK)K, and B=(β1,,βK)p×K. Since L(B,η)=L(J), we let BL(B,η)=(β1L(B,η),,βKL(B,η))1×pK, be the gradient function with respect to B. Based on the detailed calculation in Lemma A.1, BL(B,η) can be written as

BL(B,η)=6n[{(BX)}3ηy]*[({(BX)}2η)X], (III.7)

where {(BX)}3 and {(BX)}2 are entry-wise cubic and squared matrices of (BX). Define φh(x) as the thresholding function with a level h that satisfies the following minimal assumptions:

|φh(x)x|h,x,and φh(x)=0,  when  |x|h. (III.8)

Many widely used thresholding schemes, such as hard thresholding Hh(x) = xI(|x|>h), soft-thresholding Sh(x) = sign(x)max(|x| − h,x), satisfy (III.8). With a slight abuse of notation, we further define the vector thresholding function as φh(x)=(φh(x1),,φh(xp)) for xp.

The initial estimates η(0) and B(0) will be updated by thresholded gradient descent in two steps summarized in Algorithm 2. It is noteworthy that only B is updated in Step 3, while η will be updated in Step 4 after finishing the update of B.

Step 3: Updating B via Thresholded Gradient descent.

We update B(t) via thresholded gradient descent,

vec(B(t+1))=φμh(B(t))ϕ(vec(B(t))μϕBL(B(t),η(0))). (III.9)

Here,

  • μ is the step size and ϕ=i=1nyi2/n serves as an approximation for (k=1Kηk*)2 (see Lemma 15);

  • h(B)1×K is the thresholding level defined as
    h(B)=4log npn2[{{(BX)}3η(0)y}2]*{{(BX)}2η(0)}2.

Step 4: Updating η via Normalization.

We normalize each column of B(T) and estimate the weight parameter as

B^=(β^1,,β^K)=(β1(T)β1(T)2,,βK(T)βK(T)2),η^=(η^1,,η^K)=(η1(0)β1(T)23,,ηK(0)βK(T)23). (III.10)

The final estimator for J* is

J^=k=1Kη^kβ^kβ^kβ^k.

Remark 1 (Stochastic Thresholded Gradient descent). The evaluation of the gradient (III.7) requires O(npK2) operations at each iteration and can be computationally intense for large n or p. To economize the computational cost, a stochastic version of thresholded gradient descent algorithm can be easily carried out by sampling a subset of summand functions (III.7) at each iteration. This will accelerate the procedure especially in the case of large-scale settings. See Section P2 for details.

graphic file with name nihms-1621480-t0002.jpg
graphic file with name nihms-1621480-t0003.jpg

IV. Theoretical Analysis

In this section, we establish the geometric convergence rate in optimization error and minimax optimal rate in statistical error of the proposed symmetric tensor estimator.

A. Assumptions

We first introduce the assumptions for theoretical analysis. Conditions 1–3 are on the true tensor parameter J* and Conditions 4–5 are on the measurement scheme. Specifically, the first condition ensures the model identifiability for CP-decomposition.

Condition 1 (Uniqueness of CP-decomposition). The CP-decomposition in (III.2) is unique in the sense that if there exists another CP-decomposition J*=k=1Kηk*βk*βk*βk*, it must have K = K′ and be invariant up to a permutation of {1,…,K}.

For technical purposes, we introduce the following conditions to regularize the CP-decomposition of J*. Similar assumptions were imposed in recent tensor literature, e.g., [3, 27] and Assumption 1.1 (A4) [51].

Condition 2 (Parameter space). The CP-decomposition J*=k=1Kηk*βk*βk*βk* satisfies

J*opCηmax*,K=O(s),R=ηmax*/ηmin*C (IV.1)

for some absolute constants C,C′, where ηmin*=minkηk* and ηmax*=maxkηk*. Recall that s is the sparsity of βk*.

Remark 2. In Condition 2, R plays a similar role as a “condition number.” This assumption means that the tensor is “well-conditioned,” i.e., each rank-1 component is roughly of the same size.

As shown in the seminal work of [42], the estimation of low-rank tensors can be NP-hard in general. Hence, we impose the following incoherence condition.

Condition 3 (Parameter incoherence). The true tensor components are incoherent such that

Γ:=max1k1k2K|βk1*,βk2*|min{CK34R1,s12},

where R is the singular value ratio defined in (IV.1) and C″ is some small constant.

Remark 3. The preceding incoherence condition has been widely used in different scenarios in recent high-dimensional research, such as tensor decomposition [27, 50], compressed sensing [44], matrix decomposition [45], and dictionary learning [46]. It can be also viewed as a relaxation of orthogonality: if {β1*,,βK*} are mutually orthogonal, Γ equals zero. We can show from both theory (Lemma 28 in the supplementary materials) and simulation (Section VII) that the low-rank tensor J* induced by (III.2) satisfies the incoherence condition with high probability, if the component vectors βk* are randomly generated, say from Gaussian distribution.

We also introduce the following conditions on noise distribution.

Condition 4 (Sub-exponential noise). The noise {ϵi}i=1n are i.i.d. randomly generated with mean 0 and variance σ2 satisfying 0<σ<Ck=1Kηk*(ϵi/σ) is sub-exponential distributed, i.e., there exists constant Cϵ > 0 such that (ϵi/σ)ψ1:=supp1p1(E|ϵi/σ|p)1/pCϵ and is independent of {Xi}i=1n.

The sample complexity condition is crucial for our algorithm especially in the initialization stage. Ignoring any polylog factors, Condition 5 is even weaker than the sparse matrix estimation case (ns2) in [48].

Condition 5 (Sample complexity).

nCK2(s log(ep/s))32log4n.

B. Main Theoretical Results

Our main Theorem 1 shows that based on a proper initializer, the output of the proposed procedure can achieve optimal estimation error rate after a sufficient number of iterations. Here, we define the contraction parameter

0<κ=132μK2R83<1

and also denote E1=4Kηmax*23ε02 and E2=C0ηmin*43/16 for some C0 > 0.

Theorem 1 (Statistical and Optimization Errors). Suppose Conditions 3–5 hold, |supp(βk(0))|s, and the initial estimator {βk(0),ηk(0)}k=1K satisfy

max1kK{βk(0)βk*2,|ηk(0)ηk*|}K1 (IV.2)

with probability at least 1O(1/n). Assume the step size μμ0, where μ0 is defined in (A.14). Then, the output of the thresholded gradient descent update in (III.9) satisfies:

  • For any t = 0,1,2,…, the factor-wise estimator satisfies
    k=1Kηk(0)3βk(t+1)ηk*3βk*22E1κt+E2σ2s log pn (IV.3)
    with probability at least 1O(tKs/n).
  • When the total number of iterations is no smaller than
    T*=(log(nσ2s log p1)+logE1E2)/log κ1, (IV.4)
    there exists a constant C1 (independent of K,s,p,n,σ2) such that the final estimator J^=k=1Kηk(0)βk(T*)βk(T*)βk(T*) satisfies
    J^J*F2C1σ2K s log pn (IV.5)
    with probability at least 1O(T*Ks/n).

Remark 4. The error bound (IV.3) can be decomposed into an optimization error E1κt (which decays with a geometric rate as iterations) and a statistical error E2σ2s log pn (which does not decay as iterations). In the special case that σ = 0, J^ exactly recover J* with high probability.

The next theorem shows that Steps 1 and 2 of Algorithm 1 provides a good initializer required in Theorem 1.

Theorem 2 (Initialization Error). Recall Γ=max1k1k2K|βk1*,βk2*|. Suppose the number of initializations LKC3γ4, where γ is a constant defined in (A.11). Given that Conditions 1–4 hold, the initial estimator obtained from Steps 1–2 with a truncation level sdCs satisfies

max1kK{βk(0)βk*2,|ηk(0)ηk*|}C2KRδn,p,s+KΓ2 (IV.6)

and

|supp(βk(0))|s

with probability at least 1 − 5/n, where

δn,p,s=(log n)3(s3log3(ep/s)n2+s log(ep/s)n). (IV.7)

Moreover, if the sample complexity condition 5 holds, then the above bound satisfies (IV.2).

Remark 5 (Interpretation of initialization error). The upper bound of (IV.6) consists of two terms that correspond to the approximation error of Ts to J* and the incoherence among βk*’s, respectively. Especially, the former converges to zero as n grows while the latter does not.

The proof of Theorems 1 and 2 are postponed to Section C-D in the supplementary materials. The combination of Theorems 1 and 2 immediately yields the following upper bound for the final estimator, which is one main result of this paper.

Theorem 3 (Upper Bound). Suppose Conditions 1 – 5 hold, sdCs. After T* iterations, there exists a constant C1 not depending on K,s,p,n,σ2, such that the proposed procedure yields

J^J*F2C1σ2K s log pn (IV.8)

with probability at least 1O(T*Ks/n), where T* is defined in (IV.4).

The above upper bound turns out to match the minimax lower bound for a large class of sparse and low-rank tensors.

Theorem 4 (Lower Bound). Consider the following class of sparse and low-rank tensors,

Fp,K,s={J=k=1Kηkβkβkβk,J: where βk0s, for k[K],J satisfies Conditions 13 }. (IV.9)

Suppose that {Xi}i=1n are i.i.d standard normal cubic sketchings with i.i.d. N(0,σ2) noise in (III.1), p ≥ 20s, and s ≥ 4. We have the following lower bound result,

infT˜supTFp,K,sEJ˜JF2cσ2K s log(ep/s)n.

The proof of Theorem 4 is deferred to Section E in the supplementary materials. Combining Theorems 3 and 4, we immediately obtain the following minimax-optimal rate for sparse and low-rank tensor estimation with cubic sketchings when logp ≍ log(p/s):

infT˜supT*Fp,K,sEJ˜J*F2σ2K s log(ep/s)n. (IV.10)

The rate in (IV.10) sheds light upon the effect of dimension p, noise level σ2, sparsity s, sample size n and rank K to the estimation performance.

Remark 6. Recently, Li, Haupt, and Woodruff [29] studied the optimal sketching for the low-rank tensor regression and gave an near-optimal sketching complexity with a sharp (1 + ε)-worse-case error bound. Different from the framework of [29] that focuses on a deterministic setting, we study a probabilistic model with random observation noises, propose a new algorithm, and studied the minimax optimal rate of estimation errors. In addition, [5, 16, 17] considered different types of convex/non-convex algorithms for low-rank tensor regression with statistical assumptions. To our best knowledge, we are the first to achieve an optimal rate in estimation error based on polynomial-time algorithms for the tensor regression problem.

Remark 7 (Non-sparse low-rank tensor estimation via cubic-sketchings). When the low-rank tensor J* is not necessarily sparse, i.e.,

J*Fp,K={J:J=k=1Kηkβkβkβk,J satisfies Conditions 13},

we can apply the proposed procedure with all the truncation/thresholding steps removed. If nO(p3/2), we can use similar arguments of Theorems 1–3 to show that the estimator J^ satisfies

J^J*F2σ2Kpn (IV.11)

for any J*Fp,K with high probability. Furthermore, similar arguments of Theorem 4 imply that the rate in (IV.11) is minimax optimal.

Remark 8 (Comparison with existing matrix results). Our cubic sketching tensor results are far more than extensions of the existing matrix ones. For example, [32, 33] studied the low-rank matrix recovery via rank-1 projections: yi=xiTxi+ϵi and proposed the convex nuclear norm minimization methods. The theoretical properties of their estimate are analyzed under a l1/l2-RIP or Restricted Uniform Boundedness condition (RUB). However, the tensor nuclear norm is computationally infeasible and one can check that our cubic sketching framework does not satisfy RIP or RUB conditions in general following the arguments in [48, 52]. Thus, these previous results cannot be directly applied.

In addition, the analysis of gradient updates for the tensor case is significantly more complicated than the matrix case. First, it requires high-order concentration inequalities for the tensor case since the cubic-sketching tensor leads to high-order products of sub-Gaussian random variables (see Section IV-C for details). The necessity of high-order expansions in the analysis of gradient updates for the tensor case also significantly increases the hardness of the problem. To ensure the geometric convergence, we need much more subtle analysis comparing to the ones in the matrix case [52].

C. Key Lemmas: High-order Concentration Inequalities

As mentioned earlier, one major challenge for theoretical analysis of cubic sketching is to handle heavy tails of high-order Gaussian moments. One can only handle up-to second moments of sub-Gaussian random variables by directly applying the Hoeffding’s or Bernstein’s concentration inequalities. Therefore, we need to develop the following high-order concentration inequalities as technical tools: Lemma 1 characterizes the tail bounds for the sum of sub-Gaussian products, and Lemma 2 provides the concentration inequalities for Gaussian cubic sketchings. The proofs of Lemmas 1 and 2 are given in Section B.

Lemma 1 (Concentration inequality for sum of sub-Gaussian products). Suppose Xi=(x1i,,xmi)m×p,i[n] are n i.i.d random matrices. Here, suppose xij, the j-th row of Xi, is an isotropic sub-Gaussian vector, i.e., Exij=0 and Cov(xij) = I. Then for any vectors a=(a1,an)n, {βj}j=1mp, and 0 < δ < 1, we have

|i=1naij=1m(xijβj)E(i=1naij=1m(xijβj))|Ci=1mβj2(a(log δ1)m/2+a2(log δ1)1/2)

with probability at least 1 – δ for some constant C.

Note that in Lemma 1, each Xi does not necessarily have independent entries, even though {Xi}i=1n are independent matrices. Building on Lemma 1, Lemma 2 provides a generic spectral-type concentration inequality that can be used to quantify the approximation error of Ts introduced in Step 1 of the proposed procedure.

Lemma 2 (Concentration inequality for Gaussian cubic sketchings). Suppose {x1i}i=1n~iidN(0,Ip1), {x2i}i=1n~iidN(0,Ip2), {x3i}i=1n~iidN(0,Ip3), β1p1, β2p2, β3p3 are fixed vectors.

  • Define Mnsy=1ni=1nx1ix2ix3i,β1β2β3x1ix2ix3i. Then E(Mnsy )=β1β2β3 and
    MnsyE(Mnsy)sC(log n)3(s3 log3(ep/s)n2+s log(ep/s)n)β12β22β32,
    with probability at least 1 − 10/n3 − 1/p.
  • Define Msym=1ni=1nx1ix1ix1i,β1β1β1x1ix1ix1i. Then E(Msym)=6β1β1β1+3m=1p(β1emem+emβ1em+ememβ1) and
    MsymE(Msym)sC(log n)3(s3log3(ep/s)n2+s log(ep/s)n)β123,
    with probability at least 1 − 10/n3 − 1/p.

Here, C is an absolute constant and ‖ · ‖s is the sparse tensor spectral norm defined in (II.3).

V. Application To High-Order Interaction Effect Models

In this section, we study the high-order interaction effect model in the cubic sketching framework. Specifically, we consider the following three-way interaction model

yl=ξ0+i=1pξizli+i,j=1pγijzlizlj+i,j,k=1pηijkzlizljzlk+ϵl, (V.1)

for l = 1,…,n. Here ξ, γ, and η are coefficients for the main effect, pairwise interaction, and triple-wise interaction, respectively. More importantly, (V.1) can be reformulated into the following tensor form (also see the left panel of Figure 1)

yl=B,xlxlxl+ϵl,  l=1,,n, (V.2)

where xl=(1,zl)p+1 and B(p+1)×(p+1)×(p+1) is a tensor parameter corresponding to coefficients in the following way:

{B[0,0,0]=ξ0,B[1:p,1:p,1:p]=(ηijk)1i,j,kp,B[0,1:p,1:p]=B[1:p,0,1:p]=B[1:p,1:p,0]=(γij/3)1i,jp,B[0,0,1:p]=B[0,1:p,0]=B[1:p,0,0]=(ξi/3)1ip. (V.3)

We provide the following justification for assuming the tensorized coefficient B is low-rank and sparse. First, in modern applications, such as the biomedical research [53], the response is often driven by a small portion of coefficients and a small number of factors, leading to a highly entry-wise sparse and low-rank B. Second, [54] suggested that it is suitable to model entry-wise sparse and low-enough rank tensors as arising from sparse loadings. Therefore, we assume B is CP rank-K with s-sparse factors:

B=k=1Kηkβkβkβk,  βk0s,

where K,sp. Then the number of parameters in (V.4), K(p + 1), is significantly smaller than (p + 1)3, the total number of parameters in the original three-way interaction effect model (V.1), which makes the consistent estimation of B possible in the high-dimensional case. In this case, (V.2) can be written as

yl=k=1Kηkβkβkβk,xlxlxl+ϵl, (V.4)

where l ∈ [n], ‖βk2 = 1, ‖βk0s,k ∈ [K].

By assuming zl~iidNp(0,Ip), the high-order interaction effect model (V.2) reduces to the symmetric tensor estimation model (III.1), except one slight difference that the first coordinate of xl, i.e., the intercept, is always 1. To accommodate this difference, we only need to adjust the initial unbiased estimate in the above two-step procedure. Let

Ts=16nl=1nylxlxlxl16j=1p(aejej+ejaej+ejeja), (V.5)

where a=1nl=1nylxl. Then we construct the empirical moment-based initial tensor Ts′ as

  • For i,j,k ≠ 0, Ts[i,j,k]=Ts[i,j,k], Ts[i,j,0]=Ts[i,j,0], Ts[0,j,k]=Ts[0,j,k], and Ts[i,0,k]=Ts[i,0,k].

  • For i ≠ 0, Ts[0,0,i]=Ts[0,i,0]=Ts[i,0,0]=13Ts[0,0,i]16(k=1pTs[k,k,i](p+2)ai).

  • Ts[0,0,0]=12p2(k=1pTs[0,k,k](p+2)Ts[0,0,0]).

Lemma 5 shows that Ts is an unbiased estimator for B.

The theoretical results in Section IV imply the following upper and lower bounds for the three-way interaction effect estimation.

Corollary 1. Suppose z1,…,zn are i.i.d. standard Gaussian random vectors and B satisfies Conditions 1, 2 and 3. The output, denoted as B^, from the proposed Algorithms 1 and 2 based on Ts satisfies

B^BF2Cσ2K s log pn (V.6)

with high probability. On the other hand, considering the following class of B,

Fp+1,K,s={B=k=1KηkβkβkβkB: where βk0s, for k[K],B satisfies Conditions 13,}.

Then the following lower bound holds,

infB^supBFp+1,K,sEB^BF2Cσ2K s log pn.

VI. Non-Symmetric Tensor Estimation Model

In this section, we extend the previous results to the non-symmetric tensor case. Specifically, we have J*p1×p2×p3 and

yi=T*,Xi+ϵi,    Xi=uiviwi,    i[n], (VI.1)

where uip1, vip2, wip3 are random vectors with i.i.d. standard normal entries. Again, we assume J* is sparse and low-rank in a similar sense that

J*=k=1Kηk*β1k*β2k*β3k*,β1k*2=β2k*2=β3k*2=1,max{β1k*0,β2k*0,β3k*0}s. (VI.2)

Denote

  • B1 = (β11, ⋯, β1K), B2 = (β21, ⋯, β2K), B3 = (β31, ⋯, β3K),

  • U = (u1,…,un), V = (v1,…,vn), W = (w1,…,wn), η = (η1,…,ηk),y = (y1,…,yn).

Then, the empirical risk function can be written compactly as

L(B1,B2,B3,η)=1n(UB1)*(VB2)*(WB3)ηy22. (VI.3)

Since (VI.3) is non-convex but fortunately tri-convex in terms of B1, B2, and B3, we develop a block-wise thresholded gradient descent algorithm as detailed below. The complete algorithm is deferred to Section O1 in the supplementary materials.

Step 1: (Method of Tensor Moments)

Construct the empirical moment-based estimator

T:=1ni=1nyiuiviwip1×p2×p3 (VI.4)

to which sparse tensor decomposition is applied for initialization.

Step 2: (Block-wise Gradient Descent)

Lemma 17 shows that the gradient function for (VI.3) with respect to B1 can be written as

B1L(B1,B2,B3,η)=D(C1U)1×(p1K), (VI.5)

where D=(B1U)*(B2V)*(B3W)ηy and C1=(B2V)*(B3W)η. For t = 1, …, T, we fix B2(t), B3(t) and update B1(t+1) via block-wise thresholded gradient descent,

vec(B1(t+1))=φμh(B1(t))ϕ(vec(B1(t))μϕB1L(B1(t),B2(t),B3(t),η)),

where ϕ=i=1nyi2/n, μ is the step size, and h(B)=4lognpn2{D2}{C2}. The updates of B2,B3 are similar.

The theoretical analysis for the non-symmetric case is different from the symmetric one in two folds. First, the non-symmetric cubic sketching tensor is formed by three Gaussian vectors rather than one, which leads to many differences in the calculation of high-order moments. Second, the CP-decomposition of non-symmetric tensor J* (VI.2) forms a tri-convex optimization. At this point, the standard convex analysis for vanilla gradient descent [55] could be applied given a proper initialization.

With the regularity conditions detailed in Section O1, we present the theoretical results for non-symmetric tensor estimation as follows.

Theorem 5 (Upper Bound). Suppose Conditions 6 – 9 hold and n ≳ (slog(p0/s))3/2, where p0 = max{p1,p2,p3}. For any t = 0,1,2,…, the output of Algorithm O1 satisfies

k=1Kj=13ηk3βjk(t+1)ηk*3βjk*22Op(κt+σ2s log p0n)

for some 0 < κ < 1. When the total number of iterations is no smaller than log(nσ2slogp01)/logκ1, the final estimator J^ satisfies

J^J*F2Op(σ2K s log p0n).

Theorem 6 (Lower Bound). Consider the class of incoherent sparse and low-rank tensors F={J:J=k=1Kβ1kβ2kβ3k,βi,k0s for i=1,2,3,k=1,,K}. If {Xi}i=1n are i.i.d standard normal cubic sketchings, ϵ~iidN(0,σ2), min{p1,p2,p3} ≥ 20s, and s ≥ 4, we have

infJ^supJFEJ^JF2Cσ2s K log(ep0/s)n. (VI.6)

Theorems 5 and 6 imply that the proposed algorithm achieves a minimax-optimal rate of estimation error in the class of F as long as log(p0) ≍ log(p0/s).

VII. Numerical Results

In this section, we investigate the effect of noise level, CP-rank, sample size, dimension, and sparsity on the estimation performance by simulation studies. We also investigate the numerical performance of the proposed algorithm when the incoherence assumption required in the theoretical analysis fails to hold.

In each setting, we generate J*=k=1Kβk*βk*βk*, where |supp(βk*)|=s, the support of βk* is uniformly selected from {1,…,p}, and the nonzero entries of βk* are drawn randomly from standard normal distribution. Then, we calculate ηk*βk*23 and normalize βk*βk*/βk*2. The cubic sketchings {Xi}i=1n are generated as Xi=xixixi and xi~iidN(0,1). The noise satisfies {ϵi}i=1n~iıdN(0,σ2) or Laplace(0,σ/2). Additionally, we adopt the following stopping rules in iterations: (1) the initialization iteration (Step 2 in Algorithm 1) is stopped if bm(l+1)bm(l)2106; (2) the gradient update iteration (Step 3 in Algorithm 2) is stopped if ‖B(T+1)B(T)F ≤ 10−6. The numerical results are based on 200 repetitions unless otherwise specified. The code was written in R and implemented on an Intel Xeon-E5 processor with 64 GB of RAM.

First, we consider the percentage of successful recovery in the noiseless case. Let K = 3, s/p = 0.3, p = 30 or 50, so that the total number of unknown parameters in J* is 2.7 × 104 or 1.25 × 105. The sample size n ranges from 500 to 6000. Each recovery is called “successful” if the relative error J^J*F/J*F<104. We report the average successful recovery rate in Figure VII. We can see from Figure VII that the empirical relation among successful recovery, dimension, and sample size is consistent with the theoretical results in Section IV.

We then move to the noisy case. Select K = 3, s/p = 0.3, p ∈ {30, 50}, {ϵi}i=1n~iidN(0,σ2). We consider two scenarios: (1) sample size n = 6000, 8000, or 10000, s/p = 0.3, the noise level σ varies from 0 to 200; (2) noise level σ = 200, sample size n varies from 4000 to 10000, p = 30, s/p = 0.1,0.3,0.5. The estimation errors in terms of J^J*F/J*F in these two scenarios are plotted in Figures 3 and 4, respectively. These results show that the proposed procedure achieves a good performance – Algorithms 1 and 2 yield more accurate estimation with smaller variance σ2 and/or large value of sample size n.

Fig. 3.

Fig. 3.

Estimation error under different noise levels. Left panel: p = 30, right panel: p = 50

Fig. 4.

Fig. 4.

Estimation error under different dimension/sample ratio (n/p3). Left panel: initial estimation error, right panel: final estimation error

Next, we demonstrate that the low-rank tensor parameter J* with randomly generated factors βk* satisfies the incoherence condition 3 with high probability. Set the CP-rank K = 3 and the sparsity level s/p = 0.3 with the dimension p ranging from 10 to 2000. We compute the incoherence parameter Γ defined in Condition 3. The left panel of Figure 5 shows that the incoherence parameter Γ decays in a polynomial rate as s grows, which matches the bound in Condition 3. Recall a theoretical justification on this point is also provided in Lemma 28.

Fig. 5.

Fig. 5.

Left panel: incoherence parameter Γ with varying sparsity. Here, the red line corresponds to the rate s required in the theoretical analysis. Right panel: average relative estimation error for tensors with varying incoherence.

We further examine the performance of the proposed algorithm when the incoherence condition required in the theoretical analysis fails to hold. Specifically, we set the CP-rank K = 3, p = 30, and the sparsity level s/p = 0.3. We construct enormous copies of tensor parameter Jj* with i.i.d. standard normal factor vectors βk*. For each Jj*, we calculate the incoherence Γj defined in Condition 3, then manually pick 40 Jj* such that

0.01(j1)Γj0.01j  for  j={1,2,,40}.

In this way, we obtain a set of tensor parameters {Jj*} with incoherence uniformly varying from 0 to 0.4. The right panel of Figure 5 plots the relative error for estimating J* based on observations from cubic sketchings of Jj* based on 1000 repetitions. We can see that the proposed algorithm achieves small relative errors even when the true factors are highly coherent.

Moreover, we consider a setting with Laplacian noise. Suppose {ϵi}i=1n~iidLap(σ) with density f(x)=1σexp(2|x|/σ). With n = 3000, p = 30, and varying values of σ, the average estimation error and its comparison with Gaussian noise setting are provided in Figure 6. We note that the estimation errors under Laplace noise are slightly higher than those under Gaussian noise.

Fig. 6.

Fig. 6.

Comparison of estimation errors between Laplace error and Gaussian error

We also compare the estimation errors of initial and final estimators for different ranks and sample sizes. Set K = 3,p = 30,s/p = 0.3 and consider the noiseless setting. It is clear from Figure 7 that the initialization error decays sufficiently, but does not converge to zero as sample size n grows. This result matches our theoretical findings in Theorem 2: as discussed in Remark 5, the initial stage may yield an inconsistent estimator due to the incoherence among βk’s. We also evaluate and compare the estimation errors for both initial and final estimators. From the right panel of Figure 7, we can see that the final estimator is more stable and accurate compared to the initial one, which illustrates the merit of thresholded gradient descent step of the proposed procedure.

Fig. 7.

Fig. 7.

Log relative estimation error of initial estimation error (left panel) and initialization/final estimation error (right panel)

Finally, we compare the performance of the proposed method with the alternating least square (ALS)-based tensor regression method [3]. We specifically consider two schemes for the initialization of ALS: (a) {βk(0)} are i.i.d. standard Gaussian (cold start), and (b) {βk(0)} are generated from the proposed Algorithm 1 (warm start). Setting K = 2, s/p = 0.2, p = 30, {ϵi}i=1n~iidN(0,2002), we apply both the proposed procedure and the ALS-based algorithm and record the average estimation errors with standard deviations for both initial and final estimators. From the result in Table VII, one can see the proposed algorithm significantly outperforms the ALS under both cold and warm start schemes. The main reason is pointed out in Remark 8: the cubic sketching setting possesses distinct aspects compared with the i.i.d. random Gaussian sketching setting, so that the method proposed by [3] does not exactly fit here.

VIII. Discussions

This paper focuses on the third order tensor estimation via cubic sketchings. Moreover, all results can be extended to the higher-order case via high-order sketchings. To be specific, suppose

yi=J*,xim+ϵi,  i=1,,n,

where J*(p)m is an order-m, sparse, and low-rank tensor. In order to estimate J* based on {yi,xi}i=1n, one can first construct the order-m moment-based estimator using a generalized version of Theorem 7 and the fact that the score functions Sm(x)=(1)mmp(x)/p(x) for the density function p(x) satisfy a nice recursive equation:

Sm(x):=Sm1(x)log p(x)Sm1(x).

Then, one can similarly perform high-order sparse tensor decomposition and thresholded gradient descent to estimate J*. On the theoretical side, we can show if mild conditions hold and nC(logn)m(slogp)m/2, the proposed procedure achieves

T˜T*F2σ2K ms log(p/s)n

with high probability. The minimax optimality can be shown similarly.

Fig. 2.

Fig. 2.

Successful rate of recovery with varying sample size

TABLE I.

Estimation Error And Standard Deviation (In Subscript) Of The Proposed Method And Als-Based Method

Sample size ours warm start cold start initial
n = 4000 4.020.13 32.821.79 37.781.23 38.031.74
n = 5000 4.020.13 32.342.34 36.962.10 33.711.78
n = 6000 1.770.09 22.221.21 59.973.40 25.571.48

Acknowledgment

Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014- 18-2759). While completing this work, Guang Cheng was a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality. Anru Zhang would like to acknowledge support by NSF CAREER-1944904, NSF DMS-1811868, and NIH R01 GM131399.

Biography

Botao Hao received B.S. degree from the School of Mathematics, Nankai University, China, in 2014, and Ph.D. from the Department of Statistics, Purdue University, USA, 2019. He is currently a postdoctoral researcher in Department of Electrical Engineering, Princeton University, USA.

Anru Zhang received the Ph.D. degree from University of Pennsylvania, Philadelphia, PA, in 2015, and B.S. degree from Peking University, Beijing, China, in 2010. He is currently an assistant professor in Statistics at the University of Wisconsin-Madison, Madison, WI. His current research interests include high-dimensional statistical inference, tensor data analysis, statistical learning theory, dimension reduction, and convex/non-convex optimization.

Guang Cheng received BA degree in Economics from Tsinghua University, China, in 2002, and PhD degree from University of Wisconsin–Madison in 2006. He then joined Dept of Statistics at Duke University as Visiting Assistant Professor and Postdoc Fellow in SAMSI. He is currently Professor in Statistics at Purdue University, directing Big Data Theory research group, whose main goal is to develop computationally efficient inferential tools for big data with statistical guarantees.

Appendix

This appendix contains five parts: (1) Sections AB provide detailed proofs for empirical moment estimator and concentration results; (2) Sections CN provide additional proofs for the main theoretical results of this paper; (4) Section O covers the pseudo-code, conditions and main proofs of non-symmetric tensor estimation; (5) Section P discusses the matrix form of gradient function and stochastic gradient descent; (6) Section Q provides several technical lemmas and their proofs.

A. Moment Calculation

We first introduce three lemmas to show that the empirical moment based tensors (III.6), (V.5), and (VI.4) are all unbiased estimators for the target low-rank tensor in the corresponding scenarios. Detail proofs of three lemmas are postponed to Sections G1, G2 and G3 in the supplementary materials.

Lemma 3 (Unbiasedness of moment estimator under non-symmetric sketchings). For non-symmetric tensor estimation model (VI.1) & (VI.2), define the empirical moment-based tensor T by

T:=1ni=1nyiuiviwi.

Then T is an unbiased estimator for J*, i.e.,

E(T)=k=1Kηk*β1k*β2k*β3k*.

The extension to the symmetric case is non-trivial due to the dependency among three identical sketching vectors. We borrow the idea of high-order Stein’s identity, which was originally proposed in [49]. To fix the idea, we present only third order result for simplicity. The extension to higher-order is straightforward.

Theorem 7 (Third-order Stein’s Identity, [49]). Let xp be a random vector with joint density function p(x). Define the third order score function S3(x):pp×p×p as S3(x)=3p(x)/p(x). Then for continuously differentiable function G(x):p, we have

E[G(x)S3(x)]=E[3G(x)]. (A.1)

In general, the order-m high-order score function is defined as

Sm(x)=(1)mmp(x)p(x).

Interestingly, the high-order score function has a recursive differential representation

Sm(x):=Sm1(x)log p(x)Sm1(x), (A.2)

with S0(x)=1. This recursive form is helpful for constructing unbiased tensor estimator under symmetric cubic sketchings. Note that the first order score function S1(x)=logp(x) is the same as score function in Lemma 26 (Stein’s lemma [56]). The proof of Theorem 7 relies on iteratively applying the recursion representation of score function (A.2) and the first-order Stein’s lemma (Lemma 26). We provide the detailed proof in Section F for the sake of completeness.

In particular, if x follows a standard Gaussian vector, each order score function can be calculated based on (A.2) as follows,

S1(x)=x,S2(x)=xxId×d,S3(x)=xxxj=1p(xejej+ejxej+ejejx). (A.3)

Interestingly, if we let G(x)=k=1Kηk*(xβk*)3, then

163G(x)=k=1Kηk*βk*βk*βk*, (A.4)

which is exactly J*. Connecting this fact with (A.1), we are able to construct the unbiased estimator in the following lemma through high-order Stein’s identity.

Lemma 4 (Unbiasedness of moment estimator under symmetric sketchings). Consider the symmetric tensor estimation model (III.1) & (IV.9). Define the empirical first-order moment m1:=1ni=1nyixi. If we further define an empirical third-order-moment-based tensor Ts by

Ts:=16[1ni=1nyixixixij=1p(m1ejej+ejm1ej+ejejm1)],

then

E(Ts)=k=1Kηk*βk*βk*βk*.

Proof. Note that yi = G(xi) + ϵi. Then we have

E(1ni=1nyiS3(x))=E(1ni=1n(G(xi)+ϵi)S3(xi)),

where S3(x) is defined in (A.3). By using the conclusion in Theorem 7 and the fact (A.4), we obtain

E(Ts)=E(16ni=1nyiS3(x))=k=1Kηk*βk*βk*βk*,

since ϵi is independent of xi. This ends the proof. ■

Although the interaction effect model (V.1) is still based on symmetric sketchings, we need much more careful construction for the moment-based estimator, since the first coordinate of the sketching vector is always constant 1. We give such an estimator in the following lemma.

Lemma 5 (Unbiasedness of moment estimator in interaction model). For interaction effect model (V.1), construct the empirical moment based tensor Ts as following

  • For i,j,k ≠ 06, Ts[i,j,k]=Ts[i,j,k]. And Ts[i,j,0]=Ts[i,j,0], Ts[0,j,k]=Ts[0,j,k], Ts[i,0,k]=Ts[i,0,k].

  • For i ≠ 0, Ts[0,0,i]=Ts[0,i,0]=Ts[i,0,0]=13Ts[0,0,i]16(k=1pTs[k,k,i](p+2)ai).

  • Ts[0,0,0]=12p2(k=1pTs[0,k,k](p+2)Ts[0,0,0])

The Ts is an unbiased estimator for B i.e.,

E(Ts')=k=1Kηkβkβkβk.

B. Proofs of Lemmas 1 and 2: Concentration Inequalities

We aim to prove Lemmas 1 and 2 in this subsection. These two lemmas provide key concentration inequalities of the theoretical analysis for the main result. Before going into technical details, we introduce a quasi-norm called ψα-norm.

Definition 1 (ψα-norm [34]). The ψα-norm of any random variable X and α > 0 is defined as

Xψα:=inf{C(0,):E[exp(|X|/C)α]2}.

Particularly, a random variable who has a bounded ψ2-norm or bounded ψ1-norm is called sub-Gaussian or sub-exponential random variable, respectively. Next lemma provides an upper bound for the p-th moment of sum of random variables with bounded ψα-norm.

Lemma 6. Suppose X1,…,Xn are n independent random variables satisfying Xiψαb with α > 0, then for all a=(a1,,an)n and p ≥ 2,

(E|i=1naiXiE(i=1naiXi)|p)1p      {C1(α)b(pa2+p1/αa), if 0<α<1;C2(α)b(pa2+p1/αaα*), if α1. (A.5)

where 1* + 1 = 1, C1(α),C2(α) are some absolute constants only depending on α.

If 0 < α < 1, (A.5) is a combination of Theorem 6.2 in [57] and the fact that the p-th moment of a Weibull variable with parameter α is of order p1. If α ≥ 1, (A.5) follows from a combination of Corollaries 2.9 and 2.10 in [58]. Continuing with standard symmetrization arguments, we reach the conclusion for general random variables. When α = 1 or 2, (A.5) coincides with standard moment bounds for a sum of sub-Gaussian and sub-exponential random variables in [59]. The detailed proof of Lemma 6 is postponed to Section H.

When 0 < α < 1, by Chebyshev’s inequality, one can obtain the following exponential tail bound for the sum of random variables with bounded ψα-norm. This lemma generalizes the Hoeffding-type concentration inequality for sub-Gaussian random variables (see, e.g. Proposition 5.10 in [59]), and Bernstein-type concentration inequality for sub-exponential random variables (see, e.g. Proposition 5.16 in [59]).

Lemma 7. Suppose 0 < α < 1, X1,…,Xn are independent random variables satisfying Xiψαb. Then there exists absolute constant C(α) only depending on α such that for any a=(a1,,an)n and 0 < δ < 1/e2,

|i=1naiXiE(i=1naiXi)|C(α)ba2(logδ1)1/2+C(α)ba(logδ1)1/α,

with probability at least 1 − δ.

Proof. For any t > 0, by Markov’s inequality,

(|i=1naiXiE(i=1naiXi)|t)=(|i=1naiXiE(i=1naiXi)|ptp)E|i=1naiXiE(i=1naiXi)|ptpC(α)pbp(pa2+p1/αa)ptp,

where the last inequality is from Lemma 6. We set t such that exp(p)=C(α)pbp(pa2+p1/αa)p/tp. Then for p ≥ 2,

|i=1naiXiE(i=1naiXi)|eC(α)b(pa2+p1/αa)

holds with probability at least 1 − exp(−p) Letting δ = exp(−p), we have that for any 0 < δ < 1/e2,

|i=1naiXiE(i=1naiXi)|C(α)b(a2(logδ1)1/2+a(logδ1)1/α),

holds with probability at least 1 − δ. This ends the proof. ■

The next lemma provides an upper bound for the product of random variables in ψα-norm.

Lemma 8 (ψα for product of random variables). Suppose X1,…,Xm are m random variables (not necessarily independent) with ψα-norm bounded by XjψαKj. Then the ψα/m-norm of j=1mXj is bounded as

j=1mXjψα/mj=1mKj.

Proof. For any {xj}j=1m and α > 0, by using the inequality of arithmetic and geometric means we have

(|j=1mxjKj|)α/m=(j=1m|xjKj|α)1/m1mj=1m|xjKj|α.

Since exponential function is a monotone increasing function, it shows that

exp(|j=1mxjKj|)α/mexp(1mj=1m|xjKj|α)=(j=1mexp(|xjKj|α))1/m1mj=1mexp(|xjKj|α). (A.6)

From the definition of ψα-norm, for j = 1,2,…,m, each individual Xj has

E(exp(|Xj|Kj)α)2. (A.7)

Putting (A.6) and (A.7) together, we obtain

E[exp(|j=1mXjj=1mKj|)α/m]=E[exp(|j=1mXjKj|)α/m]1mj=1mE[exp(|XjKj|)α]2.

Therefore, we conclude that the ψα/m-norm of j=1mXj is bounded by j=1mKj.

Proof of Lemma 1. Note that for any j = 1,2,…,m, the ψ2-norm of Xjβj is bounded by ‖βj2 [59]. According to Lemma 8, the ψ2/m-norm of j=1m(Xjβj) is bounded by j=1mβj2. Directly applying Lemma 7, we reach the conclusion.

Proof of Lemma 2. We first focus on the non-symmetric version and the proof follows three steps:

  1. Truncate the first coordinate of x1i, x2i, x3i by a carefully chosen truncation level;

  2. Utilize the high-order concentration inequality in Lemma 20 at order three;

  3. Show that the bias caused by truncation is negligible.

With slightly abuse of notations, we denote a, x, y etc. as their first coordinate of a, x, y etc. Without loss of generality, we assume p ≔ max{p1,p2,p3}. By unitary invariance, we assume β1 = β2 = β3 = e1, where e1 = (1,0,…,0).

Then, it is equivalent to prove

MnsyE(Mnsy)s=1ni=1nx1ix2ix3ix1ix2ix3ie1e1e1sC(logn)3(s3log3(p/s)n2+slog(p/s)n).

Suppose x1~N(0,Ip1),x2~N(0,Ip2),x3~N(0,Ip3) and {x1i,x2i,x3i}i=1n are n independent samples of {x1, x2, x3}. And define a bounded event Gn for the first coordinate and its corresponding population version,

Gn={maxi{|x1i|,|x2i|,|x3i|}M},
G={max{|x1|,|x2|,|x3|}M},

where M is a large constant to be specified later. Let MnsyE(Mnsy )s upper bounded by M1 + M2 where

M1=1ni=1nx1ix2ix3ix1ix2ix3iE(x1x2x3x1x2x3G)s

and

M2=E(x1x2x3x1x2x3G)e1e1e1s.

We will prove M2 that is negligible in terms of convergence rate of M1.

Bounding M1. For simplicity, we define x1=x1|G, x2=x2|G,x3=x3|G, and {x1i,x2i,x3i}i=1n are n independent samples of {x1,x2,x3}. According to the law of total probability, we have

(M1t)(Gnc)+(M11t),

where

M11=1ni=1nx1ix1ix3ix2ixi1x3iE(x1x1x2x2x3x3)s.

According to Lemma 22, the entry of x1ix1i,x2ix2i,x3ix3i are sub-Gaussian random variable with ψ2-norm M2. Applying Lemma 20, we obtain

(M11C1M6δn,s)1p,

where δn,s = ((slog(p/s))3/n2)1/2 + (slog(p/s)/n)1/2.

On the other hand,

(Gnc)3i=1n(|x1i|M)3ne1C2M2

Putting the above bounds together, we obtain

(M1C1M6δn,s)1p+3ne1C2M2.

By setting M=2logn/C2, the bound of M1 reduces to

(M164C1C23δn,s(logn)3)1p+3en3. (A.8)

Bounding M2. From the definitions of M2 and sparse spectral norm,

M2=E(x1x2x3x1x2x3G)e1e1e1s=supD(a,b)|E(x1x2x3(x1a)(x2b)(x3c)G)a1b1c1|.

where

D={a2=b2=c2=1,max{a0,b0,c0}s}.

Since x1j is independent of x1k for any jk, E(x1(x1ϱa)G)=E(x12a1G). Similar results hold for x2,x3. Then we have

M2=supD|a1b1c1||E(x12x22x32G)1|E(x12x22x32G)1|=|E(x12||x1M)E(x22||x2M)E(x32||x3M)1.

By the basic property of Gaussian random variable, we can show

1E(xi2||xiM)12MeM2/2,  i=1,2,3.

Plugging them into M2, we have

M2|(12MeM2/2)31||12M2eM26MeM2/28M3e3M2/2||26M3eM2/2|,

where the last inequality holds for a large M > 0. By the choice of M=2logn/C2, we have M2208/C23/2(logn)32/n2 for some constant C2. When n is large, this rate is negligible comparing with (A.8)

Bounding M: We put the upper bounds of M1 and M2 together. After some adjustments for absolute constant, it suffices to obtain

M1+M2C(logn)3(s3log3(p/s)n2+slog(p/s)n),

with probability at least 1 − 10/n3 − 1/p. This concludes the proof of non-symmetric part. The proof of symmetric part remains similar and thus is omitted here. ■

C. Proof of Theorem 2: Initialization Effect

Theorem 2 gives an approximation error upper bound for the sparse-tensor-decomposition-based initial estimator. In Step I of Section III-A, the original problem can be reformatted to a version of tensor denoising:

Ts=J*+E,  where  E=TsE(Ts). (A.9)

The key difference between our model (A.9) and recent works [50, 27] is that E arises from empirical moment approximation, rather than the random observation noise considered in [50] and [27]. Next lemma gives an upper bound for the approximation error. The proof of Lemma 9 is deferred to Section I.

Lemma 9 (Approximation error of Ts). Recall that E=TsE(Ts), where Ts is defined in (III.6). Suppose Condition 4 is satisfied and sdCs. Then

Es+d2C1k=1Kηk*(s3log3(p/s)n2+slog(p/s)n)(logn)4, (A.10)

with probability at least 1 − 5/n for some uniform constant C1.

Next we denote the following quantity for simplicity,

γ=C2min{R16Ks,R1452s(1+Ks)2}, (A.11)

where R is the singular value ratio, K is the CP-rank, s is the sparsity parameter, Γ is the incoherence parameter and C2 is uniform constant.

Next lemma provides theoretical guarantees for sparse tensor decomposition method.

Lemma 10. Suppose that the symmetric tensor denoising model (A.9) satisfies Conditions 1, 2 and 3 (i.e., the identifiability, parameter space and incoherence). Assume the number of initializations LKC3γ4 and the number of iterations NC4log(γ/(1ηmin*Es+d+KΓ2)) for constants C3,C4, the truncation parameter sdCs. Then the sparse-tensor-decomposition-based initialization satisfies

max{βk(0)βk*2,|ηk(0)ηk*|}C4ηmin*Es+d+KΓ2, (A.12)

for any k ∈ [K]

The proof of Lemma 10 essentially follows Theorem 3.9 in [27], we thus omit the detailed proof here. The upper bound in (A.12) contains two terms: C4ηmin*Es+d and KΓ2, which are due to the empirical moment approximation and the incoherence among different βk, respectively.

Although the sparse tensor decomposition is not optimal in statistical rate, it does offer a reasonable initial estimation provided enough samples. Equipped with (A.10) and Condition 2, the right side of (A.12) reduces to

C4ηmin*Es+d+KΓ22C1C4KR(s3log3(p/s)n2+slog(p/s)n)(logn)4+KΓ2,

with probability at least 1−5/n. Denote C0 = 4·2160·C1C4. Using Conditions 3 and 5, we reach the conclusion that

max{βk(0)βk*2,|ηk(0)ηk*|}K1R2/2160,

with probability at least 1 − 5/n.

D. Proof of Theorem 1: Gradient Update

We first introduce the following lemma to illustrate the improvement of one step thresholded gradient update under suitable conditions. The error bound includes two parts: the optimization error that describes one step effect for gradient update, and the statistical error that reflects the random noise effect. The proof of Lemma 11 is given in Section J. For notation simplicity, we drop the superscript of {βk(t),ηk} in the following proof.

Lemma 11. Let t ≥ 0 be an integer. Suppose Conditions 1–5 hold and {βk(t),ηk} satisfies the following upper bound

k=1Kηk3βk(t)ηk*3βk*224K ηmax*23ε02,maxk[K]|ηkηk*|ε0, (A.13)

with probability at least 1O(K/n), where ε0=K1R43/2160. As long as the step size μ satisfies

0<μμ0=32R20/33K[220+270K]2, (A.14)

then {βk(t+1)} can be upper bounded as

k=1Kηk3βk(t+1)ηk*3βk*22(132μK2R83)k=1nηk3βk(t)ηk*3βk*22optimization error+2C0μ2K2R83ηmin*43σ2slogpnstatistical error ,

with probability at least 1O(Ks/n).

In order to apply Lemma 11, we prove that the required condition (A.13) holds at every iteration step t by induction. When t = 0, by (IV.2) and Condition 2,

βk(0)βk*2ε0,  |ηkηk*|ε0, for k[K],

holds with probability at least 1O(1/n). Since the initial estimator output by first stage is normalized, i.e., βk(0)2=βk*2=1, by triangle inequality we have

ηk3βk(0)ηk*3βk*2ηk3βk(0)ηk*3βk(0)+ηk*3βk(0)ηk*3βk*2|ηk3ηk*3|+ηk*3βk(0)βk*2.

Note that

|ηk3ηk*3|ε0(ηk3)2+ηkηk*3+(ηk*3)2ε0ηk*3.

This implies

ηk3βk(0)ηk*3βk*22ηk*3ε0,

with probability at least 1O(1/n). Taking the summation over k ∈ [K], we have

k=1Kηk3βk(0)ηk*3βk*22k=1K4ηk*23ε024Kηmax*23ε02,

with probability at least 1O(K/n), which means (A.13) holds for t = 0.

Suppose (A.13) holds at the iteration step t − 1, which implies

k=1Kηk3βk(t)ηk*3βk*22(132μK2R83)k=1Kηk3βk(t1)ηk*3βk*22+μ2C0K2R83ηmin*43σ2slogpn4Kηmax*23ε02μ(128KR83ηmax*23ε022C0K2R83ηmin*43σ2slogpn).

Since Condition 5 automatically implies

nslogpC0σ2R23ηmin*23K64ε02,

for a sufficiently large C0, we can obtain

k=1Kηk3βk(t)ηk*3βk*224Kηmax*23ε02.

By induction, (A.13) holds at each iteration step.

Now we are able to use Lemma 11 recursively to complete the proof. Repeatedly using Lemma 11, we have for t = 1, 2, …,

k=1Kηk3βk(t+1)ηk*3βk*22(132μK2R83)tk=1Kηk3βk(0)ηk*3βk*22+C0ηmin*43σ2slogp16,

with probability at least 1O(tKs/n). This concludes the first part of Theorem 1.

When the total number of iterations is no smaller than

T*=log(C3ηmin*4/3σ2slogp)log(64ηmax*2/3Kε0n)log(132μK2R8/3),

the statistical error will dominate the whole error bound in the sense that

k=1Kηk3βk(T*)ηk*3βk*22C3ηmin*438σ2slogpn, (A.15)

with probability at least 1O(T*Ks/n).

The next lemma shows that the Frobenius norm distance between two tensors can be bounded by the distances between each factors in their CP decomposition. The proof of this lemma is provided in Section K.

Lemma 12. Suppose J and J* have CP-decomposition J=k=1Kηkβkβkβk and J*=k=1Kηk*βk*βk*βk*. If |ηkηk*|c, then

JJ*F29(1+c)(k=1Kηk3βkηk*3βk*22)(k=1K(ηk*3)4)

Denote J^=k=1Kηkβk(T*)βk(T*)βk(T*). Combing (A.15) and Lemma 12, we have

J^J*F29(1+ε0)C3ηmin*438σ2slogpnKηmax*43,=9C3R4σ2Kslogpn,

with probability at least 1O(TKs/n). By setting C1 = 9C2/4, we complete the proof of Theorem 1.

E. Proofs of Theorems 4 and 6: Minimax Lower Bounds

We first consider the proof for Theorem 6 on non-symmetric tensor estimation. Without loss of generality we assume p = max{p1,p2,p3}. We uniformly randomly generate {Ω(k,m)}m=1,,Mk=1,,K as MK subsets of {1,…,p} with cardinality of s. Here M > 0 is a large integer to be specified later. Then we construct {β(k,m)}m=1,,Mk=1,,Kp as

βj(k,m)={λ, if jΩ(k,m);0, if jΩ(k,m).

λ > 0 will also be specified a little while later. Clearly, β(k,m1)β(k,m2)222sλ for any 1 ≤ kK, 1 ≤ m1,m2M hyper-geometric distribution: (|Ω(k,m1)Ω(k,m2)|=t)=(at)(paat)(pa).

Let

w(k,m1,m2)=|Ω(k,m1)Ω(k,m2)|, (A.16)

then for any s/2 ≤ ts,

(w(k,m1,m2)=t)=s(st+1)t!(ps)(p2s+t+1)(st)!p(ps+1)s!(st)(sps+1)t2s(sps+1)t(4sps+1)t.

Thus, if η > 0, the moment generating function of w(k,m1,m2)s2 satisfies

Eexp(η(w(k,m1,m2)s2))exp(0)(w(k,m1,m2)s2)+t=s/2+1sexp(η(ts2))(w(k,m1,m2)=t)1+t=s/2+1s(4s/(ps+1))texp(η(ts/2))=1+(4sps+1)s/2+1t=0ss/21(4sps+1)texp(η(t+s/2+1s/2))(*)1+(4sps+1)t=0s/2ss/21(4seηps+1)t=1+(4sps+1)s/21(4seη/(ps+1))ss/214seη/(ps+1)<1+(4s/(ps+1))s/2114s/(ps+1)eη.

Here, (*) is due to η > 0 and ⌊s/2⌋ + 1 ≥ s/2. By setting η = log((ps + 1)/(8s)), we have

(k=1Kw(k,m1,m2)3sK4)=(k=1Kw(k,m1,m2)sK2sK4)Eexp(η(k=1Kw(k,m1,m2)sK2))exp(ηsK4)=k=1KEexp(η(w(k,m1,m2)s2))exp(ηsK4)(1+(4s/(ps+1))s/22)K*exp(sK4log(ps+18s)). (A.17)

Since p ≥ 20s and s ≥ 4, we have

(1+2(4s/(ps+1))s/2)Kexp(Klog(1+2(4p/s1)s/2))exp(Klog(1+2(419)2))exp(K0.085)exp(sKlog(p/s)0.0144),exp(sK4log(ps+18s))=exp(sKlog(p/s)4+sK4log(8p/(ps+1)))exp(sKlog(p/s)4+sK4log(819/20))exp(sKlog(p/s)0.08).

Combining the two inequalities above, we have.

(1+(4s/(ps+1))s/22)Kexp(sK4log(ps+18s))exp(c0sKlog(p/s))

for c0 = 1/20.

Next we choose M = ⌊exp(c0/2 · sK log(p/s))⌋. Note that

β(k,m1)β(k,m2)22=λ(|Ω(k,m1)\Ω(k,m2)|+|Ω(k,m2)\Ω(k,m1)|)=λ(|Ω(k,m1)|+|Ω(k,m2)|2|Ω(k,m1)Ω(k,m2)|)=2λ(s|Ω(k,m1)Ω(k,m2)|)=(A.16)2λ(sw(k,m1,m2)),

then we further have

(k=1Kβ(k,m1)β(k,m2)22sKλ2,1m1<m2M)=(k=1K2λ(sw(k,m1,m2))sKλ2,1m1<m2M)=(k=1Kw(k,m1,m2)3K4,1m1<m2M)(A.17)1M(M1)2exp(c0sKlog(p/s))>1M2exp(c0sKlog(p/s))0,

which means there are positive probability that {β(k,m)}k=1,,Km=1,,M satisfy

sKλ2min1m1<m2Mk=1Kβ(k,m1)β(k,m2)22max1m1<m2Mk=1Kβ(k,m1)β(k,m2)222sKλ. (A.18)

For the rest of the proof, we fix {β(k,m)}k=1,,Km=1,,M to be the set of vectors satisfying (A.18).

Next, recall the canonical basis ek=(0,,1kth,0,,0)p. Define

J(m)=k=1Kβ(k,m)ekek,  1mM.

For each tensor J(m) and n i.i.d. Gaussian sketches ui,vi,wip, we denote the response

y(m)={yi(m)}i=1n,  yi(m)=uiviwi,T(m)+ϵi,

where ϵi~iidN(0,σ2). Clearly, y (y(m), u, v, w) follows a joint distribution, which may vary based on different values of m.

In this step, we analyze the Kullback-Leibler divergence between different distribution pairs:

DKL((y(m1),u,v,w),(y(m2),u,v,w)):=E(y(m1),u,v,w)log(p(y(m1),u,v,w)p(y(m2),u,v,w)).

Note that conditioning on fixed values of u, v, w,

yi(m)~N(k=1K(β(k,m)ui)(e(k)vi)(e(k)wi),σ2).

By the KL-divergence formula for Gaussian distribution,

E(y(m1),u,v,w)(p(y(m1),u,v,w)p(y(m2),u,v,w)u,v,w)=12i=1n(k=1K((β(k,m1)β(k,m2))ui)*(e(k)vi)(e(k)wi))2σ2.

Therefore, for any m1 ≠ m2,

DKL((y(m1),u,v,w),(y(m2),u,v,w))=Eu,v,w12i=1n(k=1K(β(k,m1)β(k,m2))ui)(e(k)vi)(e(k)wi))2σ2=σ22i=1nk=1KEu((β(k,m1)β(k,m2))ui)2Ev(e(k)vi)2Ew(e(k)wi)2=nσ22k=1Kβ(k,m1)β(k,m2)22σ2nK sλ.

Meanwhile, for any 1 ≤ m1 < m2M,

J(m1)J(m2)F=k=1K(β(k,m1)β(k,m2))e(k)e(k)F=k=1Kβ(k,m1)β(k,m2)22sKλ2.

By generalized Fano’s Lemma (see, e.g., [60]),

infJ^supJFEJ^JFsKλ2(1σ2nKsλ+log2logM).

Finally we set λ=cσ2nlog(p/s) for some small constant c > 0, then

infJ^supJFET^JF2(infJ^supJFEJ^JF)2cσ2sKlog(p/s)n.

which has finished the proof of Theorem 6.

For the proof for Theorem 4, without loss of generality we assume K is a multiple of 3. We first partition {1,…,p} into two subintervals: I1 = {1,…,pK/3},I2 = {pK/3+1,…,p}, randomly generate {Ω(k,m)}m=1,,Mk=1,,K/3p as (MK/3) subsets of {1, …, pK/3} and construct {β(k,m)}m=1,,Mk=1,,KpK/3 as

β(k,m)={λ, if jΩ(k,m);0, if jΩ(k,m).

With M = exp(csK log(p/s)) and similar techniques as previous proof, one can show there exists positive possibility that

sKλ6min1m1<m2Mk=1K/3β(k,m1)β(k,m2)22max1m1<m2Mk=1K/3β(k,m1)β(k,m2)222sK3λ.

We then construct the following candidate symmetric tensors by blockwise design,

T(m)={T[I1,I2,I2](m)=k=1K/3β(k,m)e(k)e(k),T[I2,I1,I2](m)=k=1K/3e(k)β(k,m)e(k),T[I2,I2,I1](m)=k=1K/3e(k)e(k)β(k,m),T[I1,I1,I1](m),T[I1,I1,I2](m),T[I1,I2,I1](m),T[I2,I1,I1](m),T[I2,I2,I2](m) are all zeros. 

Then we can see for any up,

T(m),uuu=3k=1K/3(β(k,m)uI1)(e(k)uI2)2.

The rest of the proof essentially follows from the proof of Theorem 6. ■

F. Proof of Theorem 7: High-order Stein’s Lemma

The proof of this theorem follows from the one of Theorem 6 in [49]. For the sake of completeness, we restate the detail here. Applying the recursion representation of score function (A.2), we have

E[G(x)S3(x)]=E[G(x)(S2(x)xlogp(x)xS2(x))]=E[G(x)S2(x)xlogp(x)]E[G(x)xS2(x))].

Then, we apply the first-order Stein’s lemma (see Lemma 26) on function G(x)S2(x) and obtain

E[G(x)S3(x)]=E[x(G(x)S2(x))]E[G(x)xS2(x))]=E[xG(x)S2(x)+xS2(x)G(x)]E[G(x)xS2(x))]=E[xG(x)S2(x)].

Repeating the above argument two more times, we reach the conclusion. ■

G. Proofs of Lemmas 3, 4, and 5: Moment Calculation

In this subsection, we present the detail proofs of moment calculation, including non-symmetric case, symmetric case, and interaction model.

1). Proof of Lemma 3:

By the definition of {yi} in (VI.1) & (VI.2), we have

E(1ni=1nyiuiviwi)=E(1ni=1nϵiuiviwi)+E(1ni=1nlk1Kηk*(β1k*ui)(β2k*vi)(β3k*wi)uiviwi). (A.19)

First, we observe E(ϵiuiviwi)=0 due to the independence between ϵi and {ui,vi,wi}. Then, we consider a single component from a single observation

M=E((β1k*ui)(β2k*vi)(β3k*wi)uiviwi),i[n],k[K].

For notation simplicity, we drop the subscript i for i-th observation and k for k-th component such that

M=E((β1*u)(β2*v)(β3*w)uvw)p1×p2×p3. (A.20)

Each entry of M can be calculated as follows

Mijk=E((β1*u)(β2*v)(β3*w)uivjwk)=E((β1i*ui+miβ1m*um)ui)×E((β2j*ui+mjβ2m*vm)vj)×E((β3k*wk+mkβ3m*wm)wk)=β1i*β2j*β3k*,

which implies M = β1β2β3. Combining with n observations and K components, we can obtain

E(T)=1ni=1nk=1Kηk*β1kβ2kβ3k.

This finished our proof. ■

2). Proof of Lemma 4:

In this subsection, we provide an alternative and more direct proof for Lemma 4. We consider a similar single component of (A.20) but with a symmetric structure, namely, Ms=E((β*x)3xxx). Based on the symmetry of both underlying tensor and sketchings, we will verify the following three cases:

  • When i = j = k, then
    Msiii=E(βi*xi+miβm*xm)3xi3=E(βi*3xi3+3βi*2xi2(miβm*xm)+3βi*xi(miβm*xm)2+(miβm*xm)3)xi3=15βi*3+9βi*miβm*2=9βi*+6βi*3.
    The last equation is due to ‖β*‖2 = 1.
  • When ijk, then
    Msijk=E(βi*xi+βj*xj+βk*xk)3xixjxk=6βi*βj*βk*.
  • When i = jk, then
    Msiik=E(βi*xi+βk*xk+mi,kβm*xm)3xi2xk=9βi*2βk*+3βk*3+3βk*(mi,kβm*2)=9βi*2βk*+3βk*(miβm*2)=3βk*+6βi*2βk*.

Therefore, it is sufficient to calculate Ms by

Ms=3k=1Kηk*(m=1pβk*emem+emβk*em+ememβk*)+6k=1Kηk*βk*°βk*°βk*.

The first term is the bias term due to correlations among symmetric sketchings. Denote M1=1ni=1nyixi and note that E(1ni=1nyixi)=3k=1Kηk*βk* Therefore, the empirical first-order moment M1 could be used to remove the bias term as follows

E(Msm=1p(M1emem+emM1em+ememM1))=6k=1Kηk*βk*βk*βk*.

This finishes our proof. ■

3). Proof of Lemma 5:

As before, consider a single component first. For notation simplicity, we drop the subscript l for l-th observation and k for k-th component. Since each component is normalized, the entry-wise expectation of (βx)3 xxx can be calculated as

[E(βx)3xxx]0,0,0=3β02β03[E(βx)3xxx]0,0,i=3βi[E(βx)3xxx]0,i,i=6β0βi2+3β0[E(βx)3xxx]0,i,j=6β0βiβj[E(βx)3xxx]i,i,i=6βi3+9βi[E(βx)3xxx]i,i,j=6βi2βj+3βj[E(βx)3xxx]i,j,k=6βiβjβk.

Due to the symmetric structure and non-randomness of first coordinate, there are bias appearing for each entry. For i,j,k ≠ 0, we could use m=1p(aemem+emaem+emema) to remove the bias as shown in the previous proof of Lemma 4. For the subscript involving 0, the following two calculations work for removing the bias,

E(13Ts16(k=1pTs,[k,k,i](p+1)ai))=β02βi.E(12p2(k=1pTs[0,k,k](p+2)Ts[0,0,0]))=β03.

This ends the proof. ■

H. Proof of Lemma 6

Recall the ‖Xψα is defined in Definition 1. Without loss of generality, we assume ‖Xψα = 1 and EXi=0 throughout this proof. Let β = (log2)i/α and Zi = (|Xi| = − β)+, where (x)+ = x if x ≥ 0 and (x)+ = 0 if else. For notation simplicity, we define Xp=(E|X|p)1/p for a random variable X. The following step is to estimate the moment of linear combinations of variables {Xi}i=1n.

According to the symmetrization inequality (e.g., Proposition 6.3 of [61]), we have

i=1naiXip2i=1naiεiXip=2i=1naiεi|Xi|p, (A.21)

where {εi}i=1n are independent Rademacher random variables and we notice that εiXi and εi|Xi| are identically distributed. Moreover, if |Xi| ≥ β, the definition of Zi implies that |Xi| = Zi + β. And if |Xi| < β, we have Zi = 0. Thus, we have |Xi| ≤ Zi + β at any time and it leads to

2i=1naiεi|Xi|p        2i=1naiεi(β+Zi)p. (A.22)

By triangle inequality,

2i=1naiεi(β+Zi)p        2i=1naiεiZip+2i=1naiεiβp. (A.23)

Next, we will bound the second term of the RHS of (A.23). In particular, we will utilize Khinchin-Kahane inequality, whose formal statement is included in Lemma 27 for the sake of completeness. From Lemma 27 we have

i=1naiεiβp        (p121)1/2i=1naiεiβ2        βpi=1naiεi2. (A.24)

Since {εi}i=1n are independent Rademacher random variables, some simple calculations implies

(E(i=1nεiai)2)1/2 (A.25)
=(E(i=1nεi2ai2+21i<jnεiεjaiaj))1/2=(i=1nai2Eεi2+21i<jnaiajEεiEεj)1/2=(i=1nai2)1/2=a2. (A.26)

Combining inequalities (A.22)–(A.25),

2i=1naiεi|Xi|p2i=1naiεiZip+2βpa2. (A.27)

Let {Yi}i=1n are independent symmetric random variables satisfying (|Yi|t)=exp(tα) for all t ≥ 0. Then we have

(Zit)(|Xi|t+β)=(exp(|Xi|α)exp((t+β)α))E(exp(|Xi|)α)exp((t+β)α)2exp((t+β)α)2exp(tαβα)=(|Yi|t),

which implies

i=1naiεiZip    i=1naiεiYip  =  i=1naiYip, (A.28)

since εiYi and Yi have the same distribution due to symmetry. Combining (A.27) and (A.28) together, we reach

i=1naiXip2βpa2+2i=1naiYip. (A.29)

For 0 < α < 1, it follows Lemma 25 that

i=1naiYipC1(α)(pa2+p1/αa), (A.30)

where C1(α) is some absolute constant only depending on α.

For α ≥ 1, we will combine Lemma 24 and the method of the integration by parts to pass from tail bound result to moment bound result. Recall that for every non-negative random variable X, integration by parts yields the identity

EX=0(Xt)dt.

Applying this to X=|i=1naiYi|p and changing the variable t = tp, then we have

E|i=1naiYi|p=0(|i=1naiYi|t)ptp1dt02exp (cmin(t2a22,tαaα*α))ptp1dt, (A.31)

where the inequality is from Lemma 24 for all p ≥ 2 and 1 + 1* = 1. In this following, we bound the integral in three steps:

  1. If t2a22tαaα*α, (A.31) reduces to
    E|i=1naiYi|p2p0exp(ct2a22))tp1dt.
    Letting t=ct2/a22, we have
    2p0exp(ct2a22))tp1dt=pa2pcp/20ettp/21dt=pa2pcp/2Γ(p2)pa2pcp/2(p2)p/2,
    where the second equation is from the density of Gamma random variable. Thus,
    (E|i=1naiYi|p)1pp1/p(2c)1/2pa22cpa2. (A.32)
  2. If t2a22>tαaα*α, (A.31) reduces to
    E|i=1naiYi|p2p0exp(ctαaα*α))tp1dt.
    Letting t=ctα/aα*α, we have
    2p0exp(ctαaα*α))tp1dt=2paα*pαcp/α0ettp/α1dt=2αpaα*pcp/αΓ(pα)2paα*pαp/α(pα)p/α.
    Thus,
    (E|i=1naiYi|p)1p2p1/p(cα)1/αp1/αaα*4(cα)1/αp1/αaα*. (A.33)
  3. Overall, we have the following by combining (A.32) and (A.33),
    (E|i=1naiYi|p)1pmax(2c,4(cα)1/α)(pa2+p1/αaα*).
    After denoting C2(α)=max(2c,4(cα)1/α), we reach
    i=1naiYipC2(α)(pa2+p1/αaα*). (A.34)

Since 0 < β < 1, the conclusion can be reached by combining (A.29),(A.30) and (A.34). ■

I. Proof of Lemma 9

Firstly, let us consider the non-symmetric perturbation error analysis. According to Lemma 3, the exact form of E=TE(T) is given by

E=1ni=1nyiuiviwik=1Kηk*β1k*β2k*β3k*.

We decompose it by a concentration term (E1) and a noise term (E2) as follows,

E=E1+E2, (A.35)

where

E1=1ni=1nuiviwi,k=1Kηk*β1k*β2k*β3k*uiviwik=1Kηk*β1k*β2k*β3k*.
E2=1ni=1nϵiuiviwi.

Bounding E1: For k-th componet of E1, we denote

E1k=1ni=1nuiviwi,β1k*β2k*β3k*uiviwiβ1k*β2k*β3k*.

Define

δn,p,s=(log n)3(s3 log3(p/s)n2+s log(p/s)n).

By using Lemma 2 and sdCs, it suffices to have for some absolute constant C11,

E1ks+dC11δn,p,s,

with probability at least 1 − 10/n3, where ‖ · ‖s+d is the sparse tensor spectral norm defined in (II.3). Equipped with the triangle inequality, the sparse tensor spectral norm for E1 can be bounded by

E1s+dC11δn,p,sk=1Kηk*, (A.36)

with probability at least 1 − 10K/n3.

Bounding E2: Note that the random noise {ϵi}i=1n is independent of sketching vector {ui,vi,wi}. For fixed {ϵi}i=1n, applying Lemma 20, we have for some absolute constant C12

1ni=1nϵiuiviwis+dC12ϵC11δn,p,s,

with probability at least 1−1/p. According to Lemma 23, we have

(E2s+dC12σ log nδn,p,s)1p+3n4n. (A.37)

Bounding E: Putting (A.36) and (A.37) together, we obtain

Es+d(C11k=1Kηk*+C12σ log n)δn,p,s,

with probability at least 1 − 5/n. Under Condition 9, we have

Es+d2C1k=1Kηk*δn,p,s log n,

with probability at least 1 − 5/n.

The perturbation error analysis for the symmetric tensor estimation model and the interaction effect model is similar since the empirical first-order moment converges much faster than the empirical third-order moment. So we omit the detailed proof here. ■

J. Proof of Lemma 11

Lemma 11 quantifies one step update for thresholded gradient update. The proof consists of two parts.

First, we evaluate an oracle estimator {β˜k(t+1)}k=1K with known support information, which is defined as

β˜k(t+1)=φμϕh(βk(t))(βk(t)μϕkL(βk(t))F(t)). (A.38)

Here,

  • h(βk(t)) is the k-th component of h(B(t)) defined in (III-B)

  • BL(B)=(1L(β1),,KL(βK)).

  • F(t)=k=1KFk(t), where Fk(t)=supp(βk*)supp(βk(t)).

  • For a vector xp and a subset A ⊂ {1,…,p}, we denote xAp by keeping the coordinates of x with indices in A unchanged, while changing all other components to zero.

We will show that β˜k(t+1) converges as a geometric rate for optimization error and an optimal rate for statistical error. See Lemma 13 for details.

Second, we aim to prove that β˜k(t+1) and βk(t+1) are almost equivalent with high probability. See Lemma 14 for details. For simplicity, we drop the superscript of βk(t), F(t) in the following proof, and denote β˜k(t+1), βk(t+1) and F(t+1) by β˜k+, β˜k+ and F+ respectively.

Lemma 13. Suppose Conditions 1–5 hold. Assume (A.13) is satisfied and |F| ≲ Ks. As long as the step size μ ≤ 32R−20/3/(3K[220 + 270K]2), we obtain the upper bound for {β˜k+},

k=1Kηk3β˜k+ηk*3βk*22(132μR83K2)k=1Kηk3βkηk*3βk*22+2C3μ2R83ηmin*43σ2K2s log pn, (A.39)

with probability at least 1 − (21K2 + 11K + 4Ks)/n.

The proof of Lemma 13 is postponed to the Section L. Next lemma guarantees that with high probability, {βk+}k=1K is equivalent to the oracle update {β˜k+}k=1K with high probability.

Lemma 14. Recall that the truncation level h(βk) is defined as

h(βk)=4 log npn*i=1n(k=1Kηk(xiβk)3yi)2(ηk(xiβk)2)2. (A.40)

If |F| ≲ Ks, we have βk+=β˜k+ for any k ∈ [K] with probability at least 1 − (n2p)−1 and F+F.

The proof of Lemma 14 is postponed to the Section L. By using Lemma 14 and induction, we have

F(t+1)F(1)F(0)=k=1Ksupp(βk*)supp(βk(0)).

It implies for every t, we have |F(t)| ≲ Ks. Combining with Lemmas 13 and 14 together, we obtain with probability at least 1 − (21K2 + 11K + 4Ks)/n,

k=1Kηk3βk+ηk*3βk*22(132μK2R83)k=1Kηk3βkηk*3βk*22+2C3μ2R83ηmin*43σ2K2s log pn, (A.41)

This ends the proof. ■

K. Proof of Lemma 12

Based on the CP low-rank structure of true tensor parameter J*, we can explicitly write down the distance between J and J* under tensor Frobenius norm as follows

JJ*F2=i1,i2,i3(k=1nηkβki1βki2βki3k=1nηk*βki1*βki2*βki3*)2.

For notation simplicity, denote β¯k=ηk3βk,β¯k*=ηk*3βk*. Then

JJ*F2=i1,i2,i3(k=1Kβ¯ki1β¯ki2β¯ki3k=1Kβ¯ki1*β¯ki2*β¯ki3*)2=i1,i2,i3(k=1K(β¯ki1β¯ki1*)β¯ki2*β¯ki3*+k=1Kβ¯ki1(β¯ki2β¯ki2*)β¯ki3*+k=1Kβ¯ki1β¯ki2(β¯ki3β¯ki3*)+k=1Kβ¯ki1β¯ki2(β¯ki3β¯ki3*))2=RHS.

Since (a + b + c)2 ≤ 3(a2 + b2 + c2), we have

RHS3i1,i2,i3[(k=1K(β¯ki1β¯ki1*)β¯ki2*β¯ki3*)2+(k=1Kβ¯ki1(β¯ki2β¯ki2*)β¯ki3*)2+(k=1Kβ¯ki1β¯ki2(β¯ki3β¯ki3*))2].

Equipped with Cauchy-Schwarz inequality, RHS can be further bounded by

RHS3i1,i2,i3[k=1K(β¯ki1β¯ki1*)2k=1Kβ¯ki2*2β¯ki3*2+k=1K(β¯ki2β¯ki2*)2k=1Kβ¯ki12β¯ki3*2+k=1K(β¯ki3β¯ki3*)2k=1Kβ¯ki22β¯ki12]

At the same time, using ηk(1+c)ηk* for k ∈ [K],

JJ*F23[i1=1pk=1K(β¯ki1β¯ki1*)2(i2=1pi3=1pk=1Kβ¯ki2*2β¯ki3*2)+i2=1pk=1K(β¯ki2β¯ki2*)2(i1=1pi3=1pk=1Kβ¯ki12β¯ki3*2)+i3=1pk=1K(β¯ki3β¯ki3*)2(i2=1pi1=1pk=1Kβ¯ki22β¯ki12)]=3(k=1Kβ¯kβ¯k*22)(k=1K(ηk*3)4+k=1K(ηk*3)2(ηk3)2+k=1K(ηk3)4)9(1+c)(k=1Kβ¯kβ¯k*22)(k=1K(ηk*3)4).

For the non-symmetric tensor estimation model, we have

JJ*F2=i1,i2,i3(k=1Kηkβ1ki1β2ki2β3ki3k=1Kηk*β1ki1*β2ki2*β3ki3*)2.

Following the same strategy above, we obtain

JJ*F23(1+c)(k=1Kβ¯1kβ¯1k*22+k=1Kβ¯2kβ¯2k*22+k=1Kβ¯3kβ¯3k*22)(k=1K(ηk*3)4).

This ends the proof. ■

L. Proof of Lemma 13

First of all, we state a lemma to illustrate the effect of weight ϕ. The proof of Lemma 15 is deferred to Section 15.

Lemma 15. Consider {yi}i=1n come from either non-symmetric tensor estimation model (VI.1) or symmetric tensor estimation model (III.1). Suppose Conditions 3–5 hold. Then ϕ=1ni=1nyi2 is upper and lower bounded by

(166Γ39Γ)(k=1Kηk*)21ni=1nyi2(16+6Γ3+9Γ)(k=1Kηk*)2,

with probability at least 1 − (K2 + K + 3)/n, where Γ is the incoherence parameter defined in Definition 3.

According to Lemma 15, 1ni=1nyi2 approximates (k=1Kηk*)2 up to some constants with high probability. Moreover, we know that from (A.13), maxk|ηkηk*|ε0 for some small ε0. Based on those two facts described above, we replace ηk by ηk* and ϕ by (k=1Kηk*)2 for the sake of completeness. Note that this change could only result in some constant scale changes for final results. Similar simplification was used in matrix recovery scenario [62]. Therefore, we define the weighted estimator and weighted true parameter as β¯k=ηk*3βk, β¯k*=ηk*3βk*. Now, ηk*βkβkβk=β¯kβ¯kβ¯k. Recall · is the loss function defined in (III.4). Correspondingly with a slight abuse of notation, define the gradient function kL(β¯k) on F as

kL(β¯k)F=6ηk*3ni=1n(k=1K(xiFβ¯k)3yi)(xiFβ¯k)2xiF,

and its noiseless version as

kL(β¯k)F=6ηk*3ni=1n(k=1K(xiFβ¯k)3k=1K(xiFβ¯k*)3)(xiFβ¯k)2xiF. (A.42)

According to the definition of thresholding function (III.8), β˜k+ can be written as

β˜k+=βkμϕkL(β¯k)F+μϕh(β¯k)γk,

where γkp satisfies supp(γk)F,γk1 and h(β¯k) is defined as

h(β¯k)=4 log(np)n*i=1n(k=1K(xiFβ¯k)3yi)2ηk*23(xiFβ¯k)2. (A.43)

Moreover, we denote zk=β¯kβ¯k*. With a little abuse of notations, we also drop the subscript F in this section for notation simplicities.

We expand and decompose the sum of square error by three parts as follows:

k=1Kηk*3β˜k+ηk*3βk*22=k=1Kzkμηk*3ϕkL(β¯k)+μηk*3ϕh(β¯k)γk22=k=1Kzkμηk*3ϕkL(β¯k)22A: gradient update effect +k=1Kμηk*3ϕh(β¯k)γk22B: threshoding effect +k=1Kzkμηk*3ϕkL(β¯k),μηk3ϕh(β¯k)γk.C:  cross term  (A.44)

In the following proof, we will bound three parts sequentially.

1). Bounding gradient update effect:

In order to separate the optimization error and statistical error, we use the noiseless gradient kL˜(β¯k) as a bridge such that A can be decomposed as

A=k=1Kzk222μk=1Kηk*3ϕkL(β¯k),zk+μ2k=1Kηk*3ϕkL(β¯k)22k=1Kzk222μk=1Kηk*3ϕkL˜(β¯k),zkA1+2μ2k=1Kηk*3ϕkL˜(β¯k)22A2+2μ2k=1Kηk*3ϕ(kL˜(β¯k)kL(β¯k))22A3+2μk=1Kzk,ηk*3ϕ(kL˜(β¯k)kL(β¯k)),A4 (A.45)

where A1 and A2 quantify the optimization error, A3 quantifies the statistical error, and A4 is a cross term which can be negligible comparing with the rate of the statistical error. The lower bound for A1 and upper bound for A2 together coincide with the verification of regularity conditions in the matrix recovery case [52].

Step One: Lower bound for A1.

Plugging in ϕ=(k=1Kηk*)2, we have

K2R23ηmax*43(ηk*3)2ϕ=(ηk*3)2(k=1Kηk*)2K2R23ηmax*43. (A.46)

According to the definition of noiseless gradient kL˜(β˜k) and zk, A1 can be expanded and decomposed sequentially by nine terms,

A1K2R23ηmax*43[6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K(xizk)(xiβ¯k*)2)A11+6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K2(xizk)2(xiβ¯k*))A12+6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K(xizk)3)A13+6ni=1n(k=1K3(xizk)2(xiβ¯k)k=1K(xizk)(xiβ¯k*)2)A14+6ni=1n(k=1K3(xizk)2(xiβ¯k)k=1K2(xizk)2(xiβ¯k*))A15 (A.47)
+6ni=1n(k=1K3(xizk)2(xiβ¯k)k=1K(xizk)2(xiβ¯k*))A16+6ni=1n(k=1K3(xizk)3k=1K(xizk)(xiβ¯k*)2)A17+6ni=1n(k=1K3(xizk)3k=1K2(xizk)2(xiβ¯k*))A18+6ni=1n(k=1K3(xizk)3k=1K(xizk)3)]A19, (A.48)

where A11 is the main term according to the order of β¯k*, while A12 to A19 are remainder terms. The proof of lower bound for A11 to A19 follows two steps:

  1. Calculate and lower bound the expectation of each term through Lemma A.2: high-order Gaussian moment;

  2. Argue that the empirical version is concentrated around their expectation with high probability through Lemma 1: high-order concentration inequality.

Bounding A11. Note that A11 involves the product of dependent Gaussian vectors. This brings difficulties on both the calculation of expectations and the use of concentration inequality. According to the high-order Gaussian moment results in Lemma A.2, the expectation of A11 can be calculated explicitly as

E(A11)=36k=1Kk=1K(β¯k*β¯k*)2(zkzk)I1+72k=1Kk=1K(β¯k*β¯k*)(zkβ¯k*)(zkβ¯k*)I2+108k=1Kk=1K(zkβ¯k)(zkβ¯k*)(zkβ¯k*)I3+54k=1Kk=1K(β¯k*β¯k*)(zkβ¯k*)(zkzk)I4. (A.49)

Note that I1 to I4 involve the summation of K2 term. To use incoherence Condition 3, we isolate K terms with k = k′. Then, I1 to I4 could be lower bounded as

I136ηmin*4/3[k=1Kzk22Γ2(k=1Kzk2)2]I272ηmin*4/3[k=1K(zkβ¯k*)2Γ(k=1Kzk2)2]I3108ηmin*4/3[k=1K(zkβ¯k*)2Γ(k=1Kzk2)2]I454ηmin*4/3k=1Kzk220,

where Γ is the incoherence parameter. Putting the above four bounds together, they jointly provide

E(A11)36ηmin*4/3k=1Kzk22(36ηmin*4/3Γ2+180ηmin*4/3Γ)(k=1Kzk2)2. (A.50)

On the other hand, repeatedly using Lemma 1, we obtain that with probability at least 1 − 1/n,

|1ni=1n((xizk)(xiβ¯k*)2(xizk)(xiβ¯k*)2E(xizk)(xiβ¯k*)2(xizk)(xiβ¯k*)2)|C(log n)3n(ηmax*3)4zk2zk2.

Taking the summation over k,k′ ∈ [K], it could further imply that for some absolute constant C,

|A11E(A11)|18C(log n)3n(ηmax*3)4(k=1Kzk2)2, (A.51)

with probability at least 1 − K2/n.. Combining (A.50) and (A.51), we obtain with probability at least 1 − K2/n,

K2R23ηmax*43A11[36K2R83K32(216R83Γ+18C(log n)3n)]k=1Kzk22, (A.52)

Where R=ηmax*/ηmax*. Here, we use the fact Γ ≤ 1 and (k=1Kzk22)2K(k=1Kzk22).

Bounding A12 to A19: For remainder terms, we follow the same proof strategy. According to Lemma A.2, the expectation of A12 can be calculated as

E(A12)=36k=1Kk=1K(zkβ¯k*)2(zkβ¯k*)I1+72k=1Kk=1K(zkβ¯k*)(β¯k*β¯k*)(zkzk)I2+108k=1Kk=1K(zkβ¯k)(zkβ¯k*)(zkβ¯k*)I3+54k=1Kk=1K(β¯k*β¯k*)(zkβ¯k*)(zkzk)I4.

Let us analyze I1 first. Under (A.13), zk2ε0ηk*3, it suffices to show that

k=1Kk=1K(zkβ¯k)2(zkβ¯k*)k=1Kk=1Kzk22β¯k*22zk2β¯k*2ηmax*43ε0(k=1Kzk2)2.

This immediately implies a lower bound for E(A12) after we bound similarly for I2,I3 and I4,

E(A12)270ηmax*43ε0(k=1Kzk2)2. (A.53)

By Lemma 1, we obtain for some absolute constant C,

K2R23ηmax*34A12K2R23ηmax*34[E(A12)18Cηmax*34ε0(k=1Kzk2)2(log n)3n]K1R23ε0(270+18C(log n)3n)(k=1Kzk22), (A.54)

with probability at least 1−K2/n. The detail derivation is the same as in (A.52), so we omit here.

Similarly, the lower bounds of A13 to A19 can be derived as follows

K12ηmax*34A14K12ε0(270+18C(log n)3n)(k=1Kzk22)K12ηmax*34A13,A15,A17K12ε02(270+18C(log n)3n)(k=1Kzk22)K12ηmax*34A16,A18K12ε03(270+18C(log n)3n)(k=1Kzk22)K12ηmax*34A19K12ε04(270+18C(log n)3n)(k=1Kzk22). (A.55)

Putting (A.52), (A.54) and (A.55) together, we have with probability at least 1 − 9K2/n,

A1[36K2R83K32(2160R33Γ+18C(log n)3n)8ε0K1R23(270+18C(log n)3n)](k=1Kzk22).

For the above bound,

  • When the sample size satisfies
    n(18CK1/2R8/3(log n)3)2,
    we have
    max{18K32C(log n)3n,8ε0K1R2318C(log n)3n}K2R83.
  • When ε0K−1R−2/2160, we have
    8ε0K1R23270K2R83.
  • When the incoherence parameter satisfies Γ ≤ K−1/2/216, we have
    K322160R83ΓK2R83.

Note that those above conditions can be fulfilled by Conditions 3, 5 and (A.13). Thus, we are able to simplify A1 by

A132K2R83(k=1Kzk22), (A.56)

with probability at least 1 − 9K2/n.

Step Two: Upper bound for A2.

We observe the fact that

A2=k=1K1ϕηk*3kL˜(β¯k)22=supwSKs1|k=1Rηk*3ϕkL˜(β¯k),w|2, (A.57)

where S is a unit sphere. It is equivalent to show for any wSKs1,A2=|k=1Kηk*3ϕkL˜(β¯k),w| is upper bounded. According to the definition of noiseless gradient (A.42), A2 is explicitly written as

A2=6ni=1n(k=1K(xiβ¯k)3k=1K(xiβ¯k*)3)*(k=1K(ηk*3)2ϕ(xiβ¯k)2(xiw)).

Following by (A.46) and (A.48), similar decomposition can be made for A2 as follows, where the only difference is that we replace one xizk by xiw.

A2K2R23ηmin*43[6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K(xiw)(xiβ¯k*)2)A21+6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K2(xizk)(xiw)(xiβ¯k*))A22+6ni=1n(k=1K3(xizk)(xiβ¯k)2k=1K(xizk)2(xiw))A23+6ni=1n(k=1K3(xizk)2(xiβ¯k)k=1K(xiw)(xiβ¯k*)2)A24+6ni=1n(k=1K3(xizk)2(xiβ¯k)k=1K2(xizk)(xiw)(xiβ¯k*))A25+6ni=1n(k1K3(xizk)2(xiβ¯k)k=1K(xizk)(xiw)(xiβ¯k*))A26+6ni=1n(k=1K3(xizk)3k=1K(xiw)(xiβ¯k*)2)A27+6ni=1n(k=1K3(xizk)3k=1K2(xizk)(xiw)(xiβ¯k*))A28+6ni=1n(k=1K3(xizk)3k=1K(xizk)2(xiw))].A29

Let’s bound A21 first. By using the same technique when calculating E(A11) in (A.49), we derive an upper bound for E(A21).

E(A21)36ηmax*43(k=1Kzk2+(K1)k=1KΓzk2)+180ηmax*43(k=1Kzk2+(K1)k=1KΓzk2)+54ηmax*43(Kk=1Kzk2).

Equipped with Lemma 2 and the definition of tensor spectral norm (II.3), it suffices to bound A21 by

R23ηmin*34K12A21K2R2[216+54K+216KΓ+18CKδn,p,s](k=1Kzk2)

with probability at least 1−10K2/n3, where δn,p,s is defined in (IV.7).

The upper bounds for A22 to A29 follow similar forms. Combining them together, we can derive an upper bound for A2 as follows

A2K2R2[216+270K+18CKδn,p,s](k=1Kzk2)K2R2[220+270K](k=1Kzk2),

with probability at least 1 − 90K2/n3, where the second inequality utilizes Condition 5. Therefore, the upper bound of A2 is given as follows

A2K1R4[220+270K]2(k=1Kzk22), (A.58)

with probability at least 1 − 90K2/n3.

Step Three: Upper bound for A3.

By the definition of noisy gradient and noiseless gradient, A3 is explicitly written as

A3=k=1K(ηk*3)2ϕ6ni=1nϵi(xiβ¯k)2xi22K4R43ηmin*83k=1K(Ksmaxj6ni=1nϵi(xiβ¯k)2xij)2,

where the second inequality comes from (A.46). For fixed {ϵi}i=1n, applying Lemma 1, we have

|i=1nϵi(xiβ¯k)2xijE(i=1nϵi(xiβ¯k)2xij)|C(log n)32ϵ2β¯k22,

with probability at least 1 − 1/n. Together with Lemma 23, we obtain for any j ∈ [Ks],

|6ni=1nϵi(xiβ¯k)2xij|6CC0σβ¯k22(log n)3/2n,

with probability at least 1 − 4/n, where σ is the noise level. According to (A.13),

β¯kβ¯k*22k=1Kβ¯kβ¯k*22Kηmax*23ε02,

which further implies β¯k22(1+K12ε0)2ηmax*23. Equipped with union bound over j ∈ [Ks],

maxj[Ks]|6ni=1nϵi(xiβ¯k)2xij|6CC0σ(1+K12ε0)2(ηmax*3)2(log n)3/2n,6CC0σ(1+K12ε0)2(ηmax*3)2(log n)3/2n,

with probability at least 1 − 4Ks/n. Letting C=6C0(Ce)2/3(1+K12ε0)2,

A3Cηmin*43R83σ2K2s(log n)3n, (A.59)

with probability at least 1 − 4Ks/n.

Step Four: Upper bound for A4.

This cross term can be written as

A4=2k=1Kμϕ(ηk*3)2(1ni=1nϵi(xiβ¯k)2(xizk)).

To bound this term, we take the same step in Step Three which fixes the noise term {ϵi}i=1n first. Similarly, we obtain with probability at least 1 − 4K/n,

A42Cσ(log n)32nK1R43ηmin*23. (A.60)

This term is negligible in terms of the order when comparing with (A.59).

Summary. Putting the bounds (A.56), (A.58), (A.59) and (A.60) together, we achieve an upper bound for gradient update effect as follows,

A(164μK2R83+2μ2K1R4[220+270K]2)k=1Kzk22+4μCK2ηmin*43R83σ2s(log n)3n, (A.61)

with probability at least 1 − (18K2 + 4K + 4Ks)/n. ■

2). Bounding thresholding effect:

The thresholding effect term in (A.44) can also be decomposed into optimization error and statistical error. Recall that B can be explicitly written as

B=k=1Kμηk*23ϕ4log(np)n*i=1n(k=1K(xiβ¯k)3yi)2(xiβ¯k)4γk22,

where supp(γk) ⊂ Fk and ‖γk ≤ 1. By using (a + b)2 ≤ 2(a2 + b2), we have

Bμ264K s log pn(B1+B2),

where

B1=1ni=1n(k=1K(xiβ¯k)3k=1K(xiβ¯k*)3)(k=1Kηk*43ϕ2(xiβ¯k)4)B2=1ni=1nϵi2k=1Kηk*43ϕ2(xiβ¯k)4.

Bounding B1. This optimization error term shares similar structure with (A.57) but with higher order. Therefore, we follow the same idea as we did in bounding (A.57). Following by (A.46) and some basic expansions and inequalities,

B1K2R43ηmin*831n(k=1K(xiβ¯k)3k=1K(xiβ¯k*)3)(k=1K(xiβ¯k)4)K2R43ηmin*83[1ni=1n(k=1K3K(xizk)6+9K(xizk)4(xiβ¯k*)2+9K(xizk)2(xiβ¯k*)4)k=1K(xiβ¯k)4].

The main term is (xizk)2(xiβ¯k*)4 according to the order of β¯k*. We bound the main term first. Note that there exists some positive large constant C such that

E(1ni=1n(xizk)2(xiβ¯k*)4(xiβ¯k)4)Czk22β¯k*24β¯k24.

Together with Lemma 1 and (A.13), we have

k=1Kk=1K(1ni=1n(xizk)2(xiβ¯k*)4(xiβ¯k)4)C(1+(log n)5n)K2ηmax*83(1+K12ε0)4k=1Kzk22.

with probability at least 1 − 3K2/n. Overall, the upper bound of B1 takes the form

B1K2R43ηmin*83[18C(1+(log n)5n)*K2ηmin*83(1+K12ε0)4k=1Kzk22]R418C(1+(log n)5n)(1+K12ε0)4k=1Kzk22, (A.62)

with probability at least 1 − 3K2/n.

Bounding B2. We rewrite B2 by

B2=k=1Kηk*43ϕ2(1ni=1nϵi2(xiβ¯k)4).

For fixed {ϵi}i=1n, accordingly to Lemma 1, we have

|i=1nϵi2(xiβ¯k)4E(i=1nϵi2(xiβ¯k)4)|        C(log n)2ϵ22β¯k24.

Note that E((xiβ¯k)4)=3β¯k24. It will reduce to

1ni=1nϵi2(xiβ¯k)4(3ni=1nϵi2+C(log n)2nϵ22)β¯k24.

From Lemma 23, with probability at least 1 − 3/n,

|1ni=1nϵi2|C0σ2,1nϵ22C0σ2n.

Combining the above two inequalities, we obtain

|1ni=1nϵi2(xiβ¯k)4|6C0σ2β¯k24, (A.63)

with probability at least. Plugging in the definition of ϕ and (A.13), B2 is upper bounded by

B26C0σ2(1+K12ε0)4ηmin*43R83K3, (A.64)

with probability at least 1 − 7K/n.

Summary. Putting the bounds (A.62) and (A.64) together, we have similar upper bound for thresholded effect,

BC2μ2R4k=1Kzk22+C3μ2ηmin*43R83K2σ2 s log pn, (A.65)

with probability at least 1 − (3K2 + 7K)/n. ■

3). Ensemble:

From the definition of γk, it’s not hard to see actually the cross term C is equal to zero. Combining the upper bound of gradient update effect (A.61) and thresholding effect (A.65) together, we obtain

k=1Kηk3β˜k+ηk*3βk*22(164μK2R83+3μ2K1R4[220+270K]2)(k=1Kzk22)++2C3μ2R83ηmin*43σ2K2 s log pn.

As long as the step size μ satisfies

0<μ32R20/33K[220+270K]2,

we reach the conclusion

k=1Kηk3β˜k+ηk*3β˜k*22(132μK2R83)k=1Kηk3βkηk*3βk*22+2C3μ2R83ηmin*43σ2K2s log pn, (A.66)

with probability at least 1 − 4Ks/n.

M. Proof of Lemma 14

Let us consider k-th component first. Without loss of generality, suppose F ⊂ {1,2,…,Ks}. For j = Ks + 1,…,p,

βkjL(βk)=2ni=1n(k=1Kηk(xiβk)3yi)ηk(xiβk)2xij, (A.67)

and it’s not hard to see the independence between {xiβk,yi} and xij. Applying standard Hoeffding’s inequality, we have with probability at least 11n2p2,

|βkjL(βk)|4 log(np)n*i=1n(k=1Kηk(xiβk)3yi)2(ηk(xiβk))2=h(βk).

Equipped with union bound, with probability at least 11n2p,

maxKs+1jp|βkjL(βk)|h(βk).

Therefore, according to the definition of thresholding function φ(x), we obtain the following equivalence,

φμϕh(βk)(βkμϕβkL(βk))=φμϕh(βk)(βkμϕβkL(βk)F), (A.68)

holds for k ∈ [K], with probability at least 11n2p. (A.68) also provides that supp(βk+)F for every k ∈ [K], which further implies F+F. Now we end the proof. ■

N. Proof of Lemma 15

First, we consider symmetric case. According to the definition of {yi}i=1n from symmetric tensor estimation model (III.1), we separate the random noise ϵi by the following expansion,

1ni=1nyi2=1ni=1n[k=1Kηk*(xiβk*)3+ϵi]2=1ni=1n(k=1Kηk*(xiβk*)3)2I1+2ni=1nϵik=1Kηk*(xiβk*)3I2+1ni=1nϵi2I3. (A.69)

Bounding I1. We expand i-th component of I1 as follows.

(k=1Kηk*(xiβk*)3)2=k=1Kηk*(xiβk*)6+2ki<kjηki*ηkj*(xiβki*)3(xiβkj*)3. (A.70)

As shown in Corollary A.2, the expectations of above two parts takes forms of

E(xiβk*=6(βki*βkj*)3+9(βki*βkj*)βki*22βkj*22E(xiβk*)6=15βk*22.

Recall that βk*2=1 for any k ∈ [K] and Condition 3 implies for any kikj, |βki*βkj*|Γ, where Γ is the incoherence parameter. Thus, E(xiβki*)3(xiβkj*)3 is upper bounded by

|E(xiβki*)3(xiβkj*)3|6Γ3+9Γ, for any kikj. (A.71)

By using the concentration result in Lemma 1, we have with probability at least 1 − 1/n

|1ni=1n(xiβk*)6E(1ni=1n(xiβk*)6)|C1(log n)3n,|1ni=1n(xiβki*)3(xiβkj*)3E(1ni=1n(xiβki*)3(xiβkj*)3)|C1(log n)3n. (A.72)

Putting (A.70),(A.71) and (A.72) together, this essentially provides an upper bound for I1, namely

1ni=1n(k=1Kηk*(xiβk*)3)2(15+6Γ3+9Γ+2C1(log n)3n)(k=1Kηk*)2, (A.73)

with probability at least 1 − K2/n.

Bounding I2. Since the random noise {ϵi}i=1n is of mean zero and independent of {xi}, we have

E(ϵik=1Kηk*(xiβk*)3)=0.

By using the independence and Corollary 1, we have

(1ni=1nϵi(xiβk*)3C2(log n)32nnσ)(1ni=1nϵi(xiβk*)3C2σ(log n)32n|ϵ2C0σn)+(ϵ2C0nσ)1n+3n=4n.

This further implies that

1ni=1nk=1Kηk*(xiβk*)3ϵi(k=1Kηk*)C2(log n)32nσ, (A.74)

with probability at least 1 − 4K/n.

Bounding I3. As shown in Lemma 23, the random noise ϵi with sub-exponential tail satisfies

1ni=1nϵi2C3σ2. (A.75)

with probability at least 1 − 3/n.

Overall, putting (A.73), (A.74) and (A.75) together, we have with probability at least 1 − (K2 + 4K + 3)/n,

1ni=1nyi2(k=1Kηk*)215+6Γ3+9Γ+2C1(log n)3n+2C2σ(k=1Kηk*)(log n)32n+C3σ2(k=1Kηk*)2.

Under Conditions 4 & 5, the above bound reduces to

1ni=1nyi2(16+6Γ3+9Γ)(k=1Kηk*)2

with probability at least 1 − (K2 + 4K + 3)/n. The proof of lower bound is similar, and hence is omitted here.

Similar results will also hold for non-symmetric tensor estimation model. Throughout the proof, the only difference is that

E(uiβ1k*)2(viβ2k*)2(wiβ3k*)2=1.

O. Non-symmetric Tensor Estimation

1). Conditions and Algorithm:

In this subsection, we provide several essential conditions for Theorem 5 and the detail algorithm for non-symmetric tensor estimation.

Condition 6 (Uniqueness of CP-decomposition). The CP-decomposition form (VI.2) is unique in the sense that if there exists another CP-decomposition J*=k=1Kηk*β1k*β2k*β3k*, it must have K = K′ and be invariant up to a permutation of {1,…,K}.

Condition 7 (Parameter space). The CP-decomposition of J*=k=1Kηk*β1k*β2k*β3k* satisfies

J*opC1ηmax*,     K=O(s),     and     R=ηmax*/ηmin*C2

for some absolute constants C1,C2.

Condition 8 (Parameter incoherence). The true tensor components are incoherent such that

Γ:=maxkikj{|β1ki*,β1kj*|,|β2ki*,β2kj*|,|β3ki*,β3kj*|}Cmin{K34R1,s12}.

Condition 9 (Random noise). We assume the random noise {ϵi}i=1n follows a sub-exponential tail with parameter σ satisfying 0<σ<Ck=1Kηk*.

graphic file with name nihms-1621480-t0004.jpg

2). Proof of Theorem 5:

The main distinguished part of the proof for non-symmetric update is Lemma 16: one-step oracle estimator, which is parallel to Lemma 11. For the sake of completeness, we limit our attention to rank-one case and only provide the theoretical development for one-step oracle estimator in this subsection. The generalization to general rank case follows the exact same idea in the proof of symmetric update by incorporating the incoherence condition (8).

For rank-one non-symmetric tensor estimation, the model (VI.1) reduces to

yi=η*β1*β2*β3*,uiviwi+ϵi, for i=1,,n.

Suppose |supp(β1*)|=s1, |supp(β2*)|=s2, |supp(β3*)|=s3 and denote s = max{s1,s2,s3}. Define Fj(t)=supp(βj*)supp(βj(t)), F(t)=j=13Fj(t) and the oracle estimator as

β˜1(t+1)=φμϕh(β1(t))(βj(t)μϕ1L(β1(t),β2(t),β3(t))F(t)),

where h(β1(t)) has the form of

4log npn(i=1n(η(uiβ1(t))(viβ2(t))(wiβ3(t))yi)2*η23(viβ2(t))2(wiβ3(t))2)1/2, (A.76)

The definitions of β˜2(t+1) and β˜3(t+1) are similar.

Lemma 16. Let t ≥ 0 be an integer. Suppose Conditions 6–9 hold and {βj(t),η} satisfies the following upper bound

maxj=1,2,3η3βj(t)η*3βj*2η*3ε0,|ηη*|ε0 (A.77)

with probability at least 1 − CO(1/n). Assume the step size μ satisfies 0 < μ < μ0 for some small absolute constant μ0 and sdCs. Then {β˜j(t+1)} can be upper bounded as

maxj=1,2,3η3 β˜j(t+1)η*3βj*2(1μ12)maxj=1,2,3η3βj(t)η*3βj*2+μ3σ(η*3)23s log pn,

with probability at least 1 − 12s/n.

Proof. We focus on j = 1 first. To simplify the notation, we drop the superscript of iteration index t, and denote iteration index t+1 by +. Moreover, denote β¯j=η3βj,β¯j+=η3βj, β¯j*=η*3βj* for j = 1, 2, 3. Then, the gradient function is rewritten as

1L(β¯1,β¯2,β¯3)=η32ni=1n((uiβ¯1)(viβ¯2)(wiβ¯3))(viβ¯2)(wiβ¯3)ui.

According to the definition of thresholded function, β˜1+ can be explicitly written by

β˜1+=φμϕ(β¯1)(β1μϕ1L(β¯1,β¯2,β¯3)F)=β1μϕ1L(β¯1,β¯2,β¯3)F+μϕh(β¯1)γ,

where γp, supp(γ) ⊂−F and ‖γ ≤ 1. Then the oracle estimation error η3β˜1+η*3β1*2 can be decomposed by the gradient update effect and the thresholded effect,

η3β˜1+η*3β1*2=β¯1β¯1*μη3ϕ1L(β¯1,β¯2,β¯3)F2gradient update effect +μη3ϕ|h(β¯1)|3sthresholded effect . (A.78)

By using the tri-convex structure of L(β¯1,β¯2,β¯3), we borrow the analysis tool for vanilla gradient descent [55] given sufficient good initial. Following this proof strategy, we decompose the gradient update effect in (A.78) by three parts,

η3β˜1+η*3β1*2β¯1β¯1*μη3ϕ1L˜(β¯1,β¯2*,β¯3*)F2I1+μη3ϕ1L˜(β¯1,β¯2*,β¯3*)F1L˜(β¯1,β¯2,β¯3)F2I2+μη3ϕ1L˜(β¯1,β¯2,β¯3)F1L(β¯1,β¯2,β¯3)F2I3+μη3ϕ|h(β¯1)|3s,I4

where 1L˜ is the noiseless gradient as we defined in (A.42). We will bound I1, I2, I3, I4 successively in the following four subsections. For simplicity, during the following proof, we drop the index subscript F as we did in Section L. And ϕ=i=1nyi2 approximates η*2 up to constant due to Lemma 15.

3). Bounding I1:

In this section, let us denote

η3L˜(β¯1,β¯2*,β¯3*)/ϕ=f(β¯1)η31L˜(β¯1,β¯2*,β¯3*)/ϕ=f(β¯1), (A.79)

Where supp(f(β¯1))=F. When β2 and β3 are fixed, the update can be treated as a vanilla gradient descent update. The following proof follows three steps. The first two steps show that f(β¯1) is Lipshitz differentiable and strongly convex on the constraint set F, and the last step utilizes the classical convex gradient analysis.

Step One:

Verify f(β¯1) is L-Lipschitz differentiable. For any β¯1(1) and β¯1(2) whose support belong to F,

f(β¯1(1))f(β¯1(2))=(η3)2ϕ2ni=1n(ui(β¯1(1)β¯1(2))(viβ¯2*)2(wiβ¯3*)2)ui.

Then, there exist πSs1 such that

f(β¯1(1))f(β¯1(2))2=(η3)2ϕ|1ni=1n(ui(β¯1(1)β¯1(2))(viβ¯2*)2(wiβ¯3*)2)uiπ|.

Applying Lemma 2 with multiplying (β¯1(1)β¯1(2))β¯2*β¯3* it shows

|i=1n[(ui(β¯1(1)β¯1(2))(uiπ)(viβ¯2*)2(wiβ¯3*)2)]|(1+δn,p,s)β¯1(1)β¯1(2)2η*43,

with probability at least 1 − 10/n3, where δn,p,s is defined in (IV.7). Under Condition (5) with some constant adjustments, we obtain

f(β¯1(1))f(β¯1(2))25716β¯1(1)β¯1(2)2. (A.80)

with probability at least 1 − 10/n3. Therefore, f(β¯1) is Lipschitz differentiable with Lipschitz constant L=578.

Step Two:

Verify f(β¯1) is α-strongly convex. It is equivalent to prove that 2f(β¯1)mIp. Based on the inequality (3.3.19) in [63], it shows that

λmin(2(f(β¯1)))λmin(E(2f(β¯1)))λmax(2f(β¯1)E(2f(β¯1)). (A.81)

The lower bound of λmin(2(f(β¯1))) breaks into two parts: an lower bound for λmin(E(2f(β¯1))), and an upper bound for λmax(2f(β¯1)E(2f(β¯1)). The Hessian matrix of f(β¯1) is given by

2f(β¯1)=(η3)2ϕ2ni=1n(v1β¯2*)2(wiβ¯3*)2uiui.

Since ui,vi,wi are independent with each other, we have E(2f(β¯1))=2I, which implies λmin(E(2f(β¯1)))2. On the other hand,

λmax(2f(β¯1)E(2f(β¯1)))=2f(β¯1)E(2f(β¯1))2a(2f(β¯1)E(2f(β¯1)))b=2ni=1n(viβ¯2*)2(wiβ¯3*)2(uia)(uib)E(i=1n(viβ¯2*)2(wiβ¯3*)2(uia)(uib))η*43.

where a,bSs1. Equipped with Lemma 2, it yields that with probability at least 1 − 10/n3,

λmax(2f(β¯1)E(2f(β¯1)))2δn,s,p.

Together with the lower bound of λmin(E(2f(β¯1))), we have

λmin(2f(β¯1))22δn,p,s,

Under Condition 5, the minimum eigenvalue of Hessian matrix 2f(β¯1) is lower bounded by 1910 with probability at least 1−10/n3. This guarantees that f(β¯1) is strongly-convex with α=1910.

Step Three:

Combining the Lipschitz condition, strongly-convexity and Lemma 3.11 in [55], it shows that

(f(β¯1)f(β¯1*))(β¯1β¯*)αLα+Lβ¯1β¯1*22+1α+Lf(β¯1)f(β¯1*)22.

Since the gradient vanishes at the optimal point, the above inequality times 2μ simplifies to

2μf(β¯1)(β¯1β¯1*)2μαLα+Lβ¯1β¯1*222μα+Lf(β¯1)22. (A.82)

Now it’s sufficient to bound β¯1β¯1*μf(β¯1)2 as follows

β¯1β¯1*μf(β¯1)22=β¯1tβ¯1*22+μ2f(β¯1)222μf(β¯1)(β¯1β¯*)(12μαLα+L)β¯1β¯1*22+μ(μ2α+L)f(β¯1)22.

where L,α are Lipschitz constant and strongly convexity parameter, respectively. If μ<80361, the last term can be neglected and we obtain the desired upper bound,

β¯1β¯1*μη3ϕ1L˜(β¯1,β¯2*,β¯3*)2(13μ)β¯1β¯1*2, (A.83)

with probability 1 − 20/n3. This ends the proof. ■

4). Bounding I2:

For simplicity, we write z1=β¯1β¯1*, z2=β¯2β¯2*, z3=β¯3β¯2*. By the definition of noiseless gradient, it suffices to decompose I2 by

η131L˜(β¯1,β¯2*,β¯3*)1L˜(β¯1,β¯2,β¯3)2        1ni=1n(uiβ¯1)(viβ¯2)(viz2)(wiβ¯3)(wiz3)ui2+1ni=1n(uiβ¯1)(viβ¯2)(viz2)(wiβ¯3)(wiβ¯3*)ui2+1ni=1n(uiβ¯1)(viβ¯2*)(viβ¯2)(wiβ¯3)(wiz3)ui2+1ni=1n(uiz1)(viβ¯2*)(viz2)(wiβ¯3*)(wiz3)ui2+1ni=1n(uiz1)(viβ¯2*)(viz2)(wiβ¯3*)2ui2+1ni=1n(uiz1)(viβ¯2*)2(wiβ¯3*)(wiz3)ui2.

Repeatedly using Lemma 2, we obtain

η131L˜(β¯1,β¯2*,β¯3*)1L˜(β¯1,β¯2,β¯3)2(1+δn,p,s)[(1+ε0)3ε0+(1+ε0)3+(1+ε0)3+ε02+2ε0]η*43maxjzj252(1+δn,p,s)η*4352(1+δn,p,s)η*43maxjzj2,

for sufficiently small ε0 with probability at least 1 − 60/n3. Under Condition 5, it suffices to get

η3ϕ1L˜(β¯1,β¯2,β¯3)1L˜(β¯1,β¯2*,β¯3*)283maxjβ¯jβ¯j*2, (A.84)

with probability at least 1 − 6/n.

5). Bounding I3:

I3 quantifies the statistical error. By the definition of noiseless gradient and noisy gradient, we have

η3ϕ1L˜(β¯1,β¯2,β¯3)1L(β¯1,β¯2,β¯3)2=(η3)2ϕ2ni=1nϵi(viβ¯2)(wiβ¯3)ui2.

The proof of this part essentially coincides with the proof for symmetric tensor estimation. Combining Lemmas 1 and 23, we have

|2ni=1nϵi(viβ¯2)(wiβ¯3)uij|C(1+ε0)2η*23σ(logn)32n,

with probability at least 1 − 4/n. Applying union bound over 3s coordinates, it suffices to get

(maxj[3s]|1ni=1nϵi(viβ¯2)(wiβ¯3)uij|C(1+ε0)2η*23σ(logn)32n)12sn.

Therefore, we reach

η3ϕ1L˜(β¯1,β¯2,β¯3)1L(β¯1,β¯2,β¯3)22Cη*23σ3s(log n)3n,

with probability at least 1 − 12s/n.

6). Bounding I4:

According to the definition of thresholding level h(β1) in (A.76), we can bound the square as follows,

(η3)2ϕ2h2(β¯1)=(η3)4ϕ24 log npn2i=1n((uiβ¯1)(viβ¯2)(wiβ¯3)(uiβ¯1*)(viβ¯2*)(wiβ¯3*)ϵi)2(viβ¯2)2(wiβ¯3)2

Based on the basic inequality (a + b)2 ≤ 2(a2 + b2), we have

((uiβ¯1)(viβ¯2)(wiβ¯3)(uiβ¯1*)(viβ¯2*)(wiβ¯3*)ϵi)22((uiβ¯1)(viβ¯2)(wiβ¯3)(uiβ¯1*)(viβ¯2*)(wiβ¯3*))2+2ϵi2.

Denote I1 and I2 corresponding to optimization error and statistical error,

I1=(η3)4ϕ24 log npn2i=1n((uiβ¯1)(viβ¯2)(wiβ¯3)(uiβ¯1*)(viβ¯2*)(wiβ¯3*))I2=(η3)4ϕ24 log npn2i=1nϵi2(viβ¯2)2(wiβ¯3)2.

Next, I1 is decomposed by some high-order polynomials as follows

I1=(η3)4ϕ24 log npn2(i=1n(uiz1)2(viz2)2(wiz3)2(viβ¯2)2(wiβ¯3)2+i=1n(uiz1)2(viz2)2(wiβ¯3*)2(viβ¯2)2(wiβ¯3)2+i=1n(uiz1)2(viβ¯2*)2(wiz3)2(viβ¯2)2(wiβ¯3)2+i=1n(uiz1)2(viβ¯2*)2(wiβ¯3*)2(viβ¯2)2(wiβ¯3)2+i=1n(uiβ¯1*)2(viβ¯2*)2(wiz3)2(viβ¯2)2(wiβ¯3)2+i=1n(uiβ¯1*)2(viz2)2(wiβ¯3*)2(viβ¯2)2(wiβ¯3)2+i=1n(uiβ¯1*)2(viz2)2(wiz3)2(viβ¯2)2(wiβ¯3)2). (A.85)

Each term contains the product of Gaussian random vectors form up to power ten. For the first term, by using Lemma 1,

1ni=1n(uiz1)2(viz2)2(wiz3)2(viβ¯2)2(wiβ¯3)2(1+ε0)4ε04(1+C(log n)5n)η*83maxj=1,2,3zj22,

with probability at least 1−1/n. Similar bounds holds for other terms. As long as nC log10 n, we have with probability at least 1 − 7/n,

I17 log pnmaxj=1,2,3β¯jβ¯j*22. (A.86)

Now we turn to bound I2. For fixed {ϵi}, we have,

|i=1nϵi2(viβ¯2)2(wiβ¯3)2i=1nϵi2β¯222β¯322|C(log n)2ϵ22β¯222β¯322.

with probability at least 1 − n−1. Combining with Lemma 23,

I24σ2η*43log pn. (A.87)

Putting (A.86) and (A.87) together, the thresholded effect can be bound by

η3ϕ|h(β1)|7log npnmaxj=1,2,3β¯jβ¯j*2+2σ(η*3)2log npn, (A.88)

with probability at least 1 − 8/n, provided n ≳ (logn)10. ■

7). Summary:

Putting the upper bounds (A.83), (A.84) and (A.88) together, we obtain that if step size μ satisfies 0 < μ < μ0 for some small μ0,

η3β˜1+η*3β1*2(1μ12)maxj=1,2,3β¯jβ¯j*2+μ3σ(η*3)23s log pn,

with probability at least 1−12s/n. This finishes our proof. ■

P. Matrix Form Gradient and Stochastic Gradient descent

1). Matrix Formulation of Gradient:

In this section, we provide detail derivations for (III.7) and (VI.5).

Lemma A.1. Let η=(η1,,ηK)K×1,X=(x1,,xn)p×n and B=(β1,,βK)p×K. The gradient of symmetric tensor estimation empirical risk function (III.5) can be written in a matrix form as follows

BL(B,η)=6n[((BX))3ηy][(((BX))2η)X].

Proof. First let’s have a look at the gradient for k-th component,

Lk(βk)=6n(k=1Kηk(xiβk)3yi)ηk(xiβk)xip×1,

for k = 1,…,K. Correspondingly, each part can be written as a matrix form,

((BXK×n))3ηyn×1(((BX))2η)XpK×n.

This implies that [((BX))3ηy][(((BX))2η)X]1×pK. Note that BL(B,η)=(L1(β1),,LK(βK))1×pK. The conclusion can be easily derived. ■

Lemma 17. Let η=(η1,,ηK)K×1, U=(u1,,un)p1×n. The gradient of non-symmetric tensor estimation empirical risk function (VI.3) can be written in a matrix form as follows

B1L(B1,B2,B3,η)=D(C1U),

where em and p.

Proof. Recall that {*,⊙} represent Hadamard product and Khatri-Rao product respectively. Then the dimensionality of D,C1,C1U can be calculated as follows

D=(B1U)n×K*(B2V)n×K*(B3W)n×Kηyn×1,C1=(B2V)*(B3W)ηn×K,C1UKp1×n.

Therefore,

B1L(B1,B2,B3,η)=D(C1U)=(1L(β1),,KL(βK)).

2). Stochastic Gradient descent:

Stochastic thresholded gradient descent is a stochastic approximation of the gradient descent optimization method. Note that the empirical risk function (III.5) that can be written as a sum of differentiable functions. Followed by (III.7), the gradient of (III.5) evaluated at i-th sketching {yi,xi} can be written as

BLi(B,η)=[((Bxi))3ηyi]*[(((Bxi))2η)xi]1×pK,

Thus, the overall gradient BLi(B,η) defined in (III.7) can be expressed as a summand of BLi(B,η),

BLi(B,η)=1ni=1nBLi(B,η).

The thresholded step remains the same as Step 3 in Algorithm1. Then the symmetric update of stochastic thresholded gradient descent within one iteration is summarized by

vec(B(t+1))=φμSGDϕh(B(t))(vec(B(t))μSGDϕBLi(B(t))).

Q. Technical Lemmas

Lemma 18. Suppose xp is a standard Gaussian random vector. For any non-random vector a,b,cp, we have the following tensor expectation calculation,

E((ax)(bx)(cx)xxx)+(abc+acb+ba+bca+cba+cab)+3m=1p(aemem(bc)+embem(ac)+ememc(ab)), (A.89)

where em is a canonical vector in p.

Proof. Recall that for a standard Gaussian random variable x, its odd moments are zero and even moments are E(x6)=15, E(x4)=4. Expanding the LHS of (A.89) and comparing LHS and RHS, we will reach the conclusion. Details are omitted here. ■

Lemma 19. Suppose up1, vp2, wp3 are independent standard Gaussian random vectors. For any non-random vector ap1, bp2, cp3, we have the following tensor expectation calculation

E((au)(bv)(cw)uvw)=abc. (A.90)

Proof. Due to the independence among u,v,w, the conclusion is easy to obtain by using the moment of standard Gaussian random variable. ■

Note that in the left side of (A.89), it involves an expectation of rank-one tensor. When multiplying any non-random rank-one tensor with same dimensionality, i.e., a1b1c1, on both sides, it will facilitate us to calculate the expectation of product of Gaussian vectors, see next Lemma for details.

Lemma A.2. Suppose xp is a standard Gaussian random vector. For any non-random vector a,b,c,dp, we have the following expectation calculation

E(xa)6=15a26,E(xa)5(xb)=15a24(ab),E(xa)4(xb)2=12a22(ab)2+3a24b22,E(xa)3(xb)3=6(ab)3+9(ab)a22b22,E(xa)3(xb)2(xc)=6(ab)2(ac)+6(ab)(bc)(aa)+3(ac)(bb)(aa),E(xa)2(xb)(xc)2(xd)=2(ac)2(bd)+4(ac)(bc)(ad)+6(ac)(ab)(cd)+3(cx)(bd)(aa).

Proof. Note that E((xa)3(xb)3)=E((xa)3xxx, bbb) Then we can apply the general result in Lemma 18. Comparing both sides, we will obtain the conclusion. Others part follows the similar strategy. ■

Next lemma provides a probabilistic concentration bound for non-symmetric rank-one tensor under tensor spectral norm.

Lemma 20. Suppose X  =  (x1,,xn), Y=(y1,,yn), Z=(z1,,zn) are three n × p random matrices. The ψ2-norm of each entry is bounded, s.t. ‖Xijψ2 = Kx,‖Yijψ2 = Ky,‖Zijψ2 = Kz. We assume the row of X,Y, Z are independent. There exists an absolute constant C such that,

(1ni=1n[xiyiziE(xiyizi)]sCKxKyKzδn,p,s)p1.(1ni=1n[xixixiE(xixixi)]sCKx3δn,p,s)p1.

Here, ‖ · ‖s is the sparse tensor spectral norm defined in (II.3) and δn,p,s=s log(ep/s)/n+s3 log(ep/s)3/n2.

Proof. Bounding spectral norm always relies on the construction of the ϵ-net. Since we will bound a sparse tensor spectral norm, our strategy is to discrete the sparse set and construct the ϵ-net on each one. Let us define a sparse set B0={xp,x2=1,x0s}. And let B0,s be the s-dimensional set defined by B0,s={xs,x2=1}. Note that B0 is corresponding to s-sparse unit vector set which can be expressed as a union of subsets of dimension s by expanding some zeros, namely B0=B0,s. There should be at most < such set.

Recalling the definition of sparse tensor spectral norm in (II.3), we have

A=1ni=1n[xiyiziE(xiyizi)]ssupχ1,χ2,χ3B0|1ni=1n[xi,χ1yi,χ2zi,χ3E(xi,χ1yi,χ2zi,χ3)]|.

Instead of constructing the ϵ-net on B0, we will construct an ϵ-net for each of subsets B0,s. Define NB0,s as the 1/2-set of B0,s. From Lemma 3.18 in [64], the cardinality of N0,s is bounded by 5s. By Lemma 21, we obtain

 sup χ1,χ2,χ3B0,s|1ni=1n[xi,χ1yi,χ2zi,χ3E(xi,χ1yi,χ2zi,χ3)]|23supχ1,χ2,χ3NB0,a|1ni=1n[xi,χ1yi,χ2zi,χ3E(xi,χ1yi,χ2zi,χ3)]|. (A.91)

By rotation invariance of sub-Gaussian random variable, xi,χ1, yi,χ2, zi,χ3 are still sub-Gaussian random variables with ψ2-norm bounded by Kx,Ky,Kz, respectively. Applying Lemma 1 and union bound over NB0,s, the right hand side of (A.91) can be bounded by

RHS8KxKyKzC(logδ1n+(logδ1)3n2),

with probability smaller than (5s)3δ for any 0 < δ < 1.

Lastly, taking the union bound over all possible subsets B0,s yields that

(A8KxKyKzC(logδ1n+(logδ1)3n2))(eps)s(5s)3δ=(125eps)sδ.

Letting p1=(125eps)sδ, we obtain with probability at least 1 − 1/p,

ACKxKyKz(s log(p/s)n+s3log3(p/s)n2),

with some adjustments on constant C. The proof for symmetric case is similar to non-symmetric case so we omit here.

Lemma 21 (Tensor Covering Number(Lemma 4 in [65])). Let be an ϵ-net for a set B associated with a norm ‖ · ‖. Then, the spectral norm of a d-mode tensor A is bounded by

supx1,,xd1BA×1x1×d1xd12(11ε)d1supx1xd1A×1x1×d1xd12.

This immediately implies that the spectral norm of a d-mode tensor A is bounded by

A2(11ϵ)d1supx1xd1NA×1x1×d1xd12,

where is the ϵ-net for the unit sphere Sn1 in n.

Lemma 22 (Sub-Gaussianess of the Product of Random Variables). Suppose X1 is a bounded random variable with |X1| ≤ K1 almost surely for some K1 and X2 is a sub-Gaussian random variable with Orlicz norm ‖X2ψ2K2. Then X1X2 is still a sub-Gaussian random variable with Orlicz norm ‖X1X2ψ2 = K1K2.

Proof: Following the definition of sub-Gaussian random variable, we have

(|X1X2|>t)=(|X2|>t|X1|)(|X2|>t|K1|)exp(1t2/K12K22),

holds for all t ≥ 0. This ends the proof. ■

Lemma 23 (Tail Probability for the Sum of Sub-exponential Random Variables (Lemma A.7 in [48])). Suppose ϵ1,…, ϵn are independent centered sub-exponential random variables with

σ:=max1inϵiψ1.

Then with probability at least 1 − 3/n, we have

|1ni=1nϵi|C0σlognn,ϵC0σlog n,|1ni=1nϵi2|C0σ2,|1ni=1nϵi4|C0σ4,

for some constant C0

Lemma 24 (Tail Probability for the Sum of Weibull Distributions (Lemma 3.6 in [34])). Let α ∈ [1,2] and Y1,…,Yn be independent symmetric random variables satisfying. Then for every vector a=(a1,,an)n and every t ≥ 0,

(|i=1naiYi|t)2exp(cmin(t2a22,tαaα*α))

Proof. It is a combination of Corollaries 2.9 and 2.10 in [58].

Lemma 25 (Moments for the Sum of Weibull Distributions (Corollary 1.2 in [66])). Let X1,X2,…,Xn be a sequence of independent symmetric random variables satisfying (|Yi|t)=exp(tα), where 0 < α < 1. Then, for p ≥ 2 and some constant C(α) which depends only on α,

i=1naiXipC(α)(pa2+p1/αa).

Lemma 26 (Stein’s Lemma [56]). Let xd be a random vector with joint density function p(x). Suppose the score function ∇x logp(x) exists. Consider any continuously differentiable function G(x):dx. Then, we have

E[G(x)xlog p(x)]=E[xG(x)].

Lemma 27 (Khinchin-Kahane Inequality (Theorem 1.3.1 in {ai}i=1n a finite non-random sequence, {εi}i=1n be a sequence of independent Rademacher variables and 1 < p < q < ∞. Then

i=1nεiaiq(q1p1)1/2i=1nεiaip.

Lemma 28. Suppose each non-zero element of {xk}k=1K is drawn from standard Gaussian distribution and ‖xk0s for k ∈ [K]. Then we have for any 0 < δ ≤ 1,

(max1k1<k2K|xk1,xk2|Cslog K+log 1/δ)1δ,

where C is some constant.

Proof. Let us denote Sk1k2[1,2,,p] as an index set such that for any i,jSk1k2, we have xk1i0 and xk2j0. From the definition of Sk1k2, we know that |Sk1k2|s and xk1xk2=j=1pxk1jxk2j=jSk1k2xk1jxk2j. We apply standard Hoeffding’s concentration inequality,

(|xk1,xk2|t)=(|jSk1k2xk1jxk2j|t)eexp(ct2s).

Letting ct2/s = log(1), we reach the conclusion.

Contributor Information

Botao Hao, Department of Electrical Engineering, Princeton University, Princeton, NJ 08540,.

Anru Zhang, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706,.

Guang Cheng, Department Statistics, Purdue University, West Lafayette, IN 47906,.

References

  • [1].Kroonenberg PM, Applied Multiway Data Analysis. Wiley Series in Probability and Statistics, 2008. [Google Scholar]
  • [2].Kolda T and Bader B, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009. [Google Scholar]
  • [3].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Li X, Xu D, Zhou H, and Li L, “Tucker tensor regression and neuroimaging analysis,” Statistics in Biosciences, vol. 10, no. 3, pp. 520–545, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Sun WW and Li L, “Store: sparse tensor response regression and neuroimaging analysis,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 4908–4944, 2017. [Google Scholar]
  • [6].Caiafa CF and Cichocki A, “Multidimensional compressed sensing and their applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3, no. 6, pp. 355–380, 2013. [Google Scholar]
  • [7].Friedland S, Li Q, and Schonfeld D, “Compressive sensing of sparse tensors,” IEEE Transactions on Image Processing, vol. 23, no. 10, pp. 4438–4447, 2014. [DOI] [PubMed] [Google Scholar]
  • [8].Liu J, Musialski P, Wonka P, and Ye J, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 208–220, 2013. [DOI] [PubMed] [Google Scholar]
  • [9].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
  • [10].Yuan M and Zhang C-H, “Incoherent tensor norms and their applications in higher order tensor completion,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6753–6766, 2017. [Google Scholar]
  • [11].Zhang A, “Cross: Efficient low-rank tensor completion,” The Annals of Statistics, vol. 47, no. 2, pp. 936–964, 2019. [Google Scholar]
  • [12].Montanari A and Sun N, “Spectral algorithms for tensor completion,” Communications on Pure and Applied Mathematics, vol. 71, no. 11, pp. 2381–2425, 2018. [Google Scholar]
  • [13].Ghadermarzy N, Plan Y, and Yilmaz Ö, “Near-optimal sample complexity for convex tensor completion,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 577–619, 2018. [Google Scholar]
  • [14].Zhang Z and Aeron S, “Exact tensor completion using t-svd,” IEEE Transactions on Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2016. [Google Scholar]
  • [15].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]
  • [16].Raskutti G, Yuan M, Chen H, et al. , “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statistics, vol. 47, no. 3, pp. 1554–1584, 2019. [Google Scholar]
  • [17].Chen H, Raskutti G, and Yuan M, “Non-convex projected gradient descent for generalized low-rank tensor regression,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 172–208, 2019. [Google Scholar]
  • [18].Li L and Zhang X, “Parsimonious tensor response regression,” Journal of the American Statistical Association, pp. 1–16, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Zhang A, Luo Y, Raskutti G, and Yuan M, “Islet: Fast and optimal low-rank tensor regression via importance sketching,” arXiv preprint arXiv:1911.03804, 2019. [Google Scholar]
  • [20].Romera-Paredes B, Aung MH, Bianchi-Berthouze N, and Pontil M, “Multilinear multitask learning,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp. III–1444–III–1452, JMLR.org, 2013. [Google Scholar]
  • [21].Bien J, Taylor J, Tibshirani R, et al. , “A lasso for hierarchical interactions,” The Annals of Statistics, vol. 41, no. 3, pp. 1111–1141, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Hao N and Zhang HH, “Interaction screening for ultrahigh-dimensional data,” Journal of the American Statistical Association, vol. 109, no. 507, pp. 1285–1301, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Fan Y, Kong Y, Li D, and Lv J, “Interaction pursuit with feature screening and selection,” arXiv preprint arXiv:1605.08933, 2016. [Google Scholar]
  • [24].Basu S, Kumbier K, Brown JB, and Yu B, “Iterative random forests to discover predictive and stable high-order interactions,” Proceedings of the National Academy of Sciences, p. 201711236, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing, pp. 517–520, IEEE, 2010. [Google Scholar]
  • [26].Vasilescu MAO and Terzopoulos D, “Multilinear subspace analysis of image ensembles,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, pp. II–93, IEEE, 2003. [Google Scholar]
  • [27].Sun WW, Lu J, Liu H, and Cheng G, “Provable sparse tensor decomposition,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 79, no. 3, pp. 899–916, 2017. [Google Scholar]
  • [28].Rauhut H, Schneider R, and Stojanac Ž, “Low rankˇ tensor recovery via iterative hard thresholding,” Linear Algebra and its Applications, vol. 523, pp. 220–262, 2017. [Google Scholar]
  • [29].Li X, Haupt J, and Woodruff D, “Near optimal sketching of low-rank tensor regression,” in Advances in Neural Information Processing Systems, pp. 3466–3476, 2017. [Google Scholar]
  • [30].Wang Z, Liu H, and Zhang T, “Optimal computational and statistical rates of convergence for sparse nonconvex learning problems,” The Annals of statistics, vol. 42, no. 6, p. 2164, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Loh P-L and Wainwright MJ, “Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” Journal of Machine Learning Research, vol. 16, pp. 559–616, 2015. [Google Scholar]
  • [32].Cai TT and Zhang A, “Rop: Matrix recovery via rank-one projections,” The Annals of Statistics, vol. 43, no. 1, pp. 102–138, 2015. [Google Scholar]
  • [33].Chen Y, Chi Y, and Goldsmith AJ, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015. [Google Scholar]
  • [34].Adamczak R, Litvak AE, Pajor A, and Tomczak-Jaegermann N, “Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling,” Constructive Approximation, vol. 34, no. 1, pp. 61–88, 2011. [Google Scholar]
  • [35].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
  • [36].Keshavan RH, Montanari A, and Oh S, “Matrix completion from a few entries,” IEEE Transactions on Information Theory, vol. 56, no. 6, pp. 2980–2998, 2010. [Google Scholar]
  • [37].Koltchinskii V, Lounici K, and Tsybakov AB, “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion,” The Annals of Statistics, pp. 2302–2329, 2011. [Google Scholar]
  • [38].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems 27 (Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, eds.), pp. 2897–2905, Curran Associates, Inc., 2014. [Google Scholar]
  • [39].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, pp. 7311–7338, November 2018. [Google Scholar]
  • [40].Mu C, Huang B, Wright J, and Goldfarb D, “Square deal: Lower bounds and improved relaxations for tensor recovery,” in Proceedings of the 31st International Conference on Machine Learning (Xing EP and Jebara T, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China: ), pp. 73–81, PMLR, 22–24 June 2014. [Google Scholar]
  • [41].Friedland S and Lim L-H, “Nuclear norm of higher-order tensors,” Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281, 2018. [Google Scholar]
  • [42].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013. [Google Scholar]
  • [43].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]
  • [44].Donoho DL, “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006. [Google Scholar]
  • [45].Chandrasekaran V, Sanghavi S, Parrilo PA, and Willsky AS, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011. [Google Scholar]
  • [46].Arora S, Ge R, and Moitra A, “New algorithms for learning incoherent and overcomplete dictionaries,” in Proceedings of The 27th Conference on Learning Theory (Balcan MF, Feldman V, and Szepesvri C, eds.), vol. 35 of Proceedings of Machine Learning Research, (Barcelona, Spain: ), pp. 779–806, PMLR, 13–15 June 2014. [Google Scholar]
  • [47].Zhang Y, Chen X, Zhou D, and Jordan MI, “Spectral methods meet em: A provably optimal algorithm for crowd-sourcing,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3537–3580, 2016. [Google Scholar]
  • [48].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016. [Google Scholar]
  • [49].Janzamin M, Sedghi H, and Anandkumar A, “Score function features for discriminative learning: matrix and tensor framework,” arXiv preprint arXiv:1412.2863, 2014. [Google Scholar]
  • [50].Anandkumar A, Ge R, and Janzamin M, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv preprint arXiv:1402.5180, 2014. [Google Scholar]
  • [51].Cai C, Li G, Poor HV, and Chen Y, “Nonconvex low-rank symmetric tensor completion from noisy data,” arXiv preprint arXiv:1911.04436, 2019. [Google Scholar]
  • [52].Candès EJ, Li X, and Soltanolkotabi M, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61, pp. 1985–2007, April 2015. [Google Scholar]
  • [53].Hung H, Lin Y-T, Chen P, Wang C-C, Huang S-Y, and Tzeng J-Y, “Detection of gene–gene interactions using multistage sparse and low-rank regression,” Biometrics, vol. 72, no. 1, pp. 85–94, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Sidiropoulos ND and Kyrillidis A, “Multi-way compressed sensing for sparse low-rank tensors,” IEEE Signal Processing Letters, vol. 19, no. 11, pp. 757–760, 2012. [Google Scholar]
  • [55].Bubeck S, Foundations and Trends in Machine Learning, ch. Convex Optimization: Algorithms and Complexity, pp. 231–357. 2015. [Google Scholar]
  • [56].Stein C, Diaconis P, Holmes S, Reinert G, et al. , “Use of exchangeable pairs in the analysis of simulations,” in Stein’s Method, pp. 1–25, Institute of Mathematical Statistics, 2004. [Google Scholar]
  • [57].Hitczenko P, Montgomery-Smith S, and Oleszkiewicz K, “Moment inequalities for sums of certain independent symmetric random variables,” Studia Math, vol. 123, no. 1, pp. 15–42, 1997. [Google Scholar]
  • [58].Talagrand M, “The supremum of some canonical processes,” American Journal of Mathematics, vol. 116, no. 2, pp. 283–325, 1994. [Google Scholar]
  • [59].Vershynin R, Compressed sensing, ch. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge Univ. Press, 2012. [Google Scholar]
  • [60].Yu B, “Assouad, fano, and le cam,” Festschrift for Lucien Le Cam, vol. 423, p. 435, 1997. [Google Scholar]
  • [61].Ledoux M and Talagrand M, Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. [Google Scholar]
  • [62].Tu S, Boczar R, Simchowitz M, Soltanolkotabi M, and Recht B, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning (Balcan MF and Weinberger KQ, eds.), vol. 48 of Proceedings of Machine Learning Research, (New York, New York, USA: ), pp. 964–973, PMLR, 20–22 June 2016. [Google Scholar]
  • [63].Horn RA and Johnson CR, Matrix Analysis. New York: Cambridge Univ. Press, 1988. [Google Scholar]
  • [64].Ledoux M, The concentration of measure phenomenon. No. 89, American Mathematical Soc., 2005. [Google Scholar]
  • [65].Nguyen NH, Drineas P, and Tran TD, “Tensor sparsification via a bound on the spectral norm of random tensors,” Information and Inference: A Journal of the IMA, vol. 4, no. 3, pp. 195–229, 2015. [Google Scholar]
  • [66].Bogucki R, “Suprema of canonical weibull processes,” Statistics & Probability Letters, vol. 107, pp. 253–263, 2015. [Google Scholar]
  • [67].De la Pena V and Gine E,´ Decoupling: from dependence to independence. Springer Science & Business Media, 2012. [Google Scholar]

RESOURCES