Abstract
In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.
I. Introduction
The rapid advance in modern scientific technology gives rise to a wide range of high-dimensional tensor data [1, 2]. Accurate estimation and fast communication/processing of tensor-valued parameters are crucially important in practice. For example, a tensor-valued predictor which characterizes the association between brain diseases and scientific measurements becomes the point of interest [3, 4, 5]. Another example is the tensor-valued image acquisition algorithm that can considerably reduce the number of required samples by exploiting the compressibility property of signals [6, 7].
The following tensor estimation model is widely considered in recent literatures,
| (I.1) |
Here, and ϵi are the measurement tensor and the noise, respectively. The goal is to estimate the unknown tensor from measurements . A number of specific settings with varying forms of have been studied, e.g., tensor completion [8, 9, 10, 11, 12, 13, 14, 15], tensor regression [5, 3, 4, 16, 17, 18, 19], multi-task learning [20], etc.
In this paper, we focus on the case that the measurement tensor can be written in a cubic sketching form. For example, or , depending on whether is symmetric or not. The cubic sketching form of is motivated by a number of applications.
Interaction effect estimation: High-dimensional high-order interaction models have been considered under a variety of settings [21, 22, 23, 24]. By writing , we find that the interaction model has an interesting tensor representation (see left panel of Figure 1) which allows us to estimate high-order interaction terms using tensor techniques. This is in contrast with the existing literature that mostly focused on pair-wise interactions due to the model complexity and computational difficulties. More detailed discussions will be provided in Section V.
High-order imaging/video compression: High-order imaging/video compression is an important task in modern digital imaging with various applications (see right panel of Figure 1), such as hyper-spectral imaging analysis [25] and facial imaging recognition [26]. One could use Gaussian ensembles for compression such that each entry of is i.i.d. randomly generated [3, 16, 17]. In contrast, the non-symmetric cubic sketchings, i.e., , reduce the memory storage from O(np1p2p3) to O(n(p1 +p2 +p3)) (n is the sample size and (p1,p2,p3) is the tensor dimension), but still preserve the optimal statistical rate. More detailed discussions will be provided in Section VI.
In practice, the total number of measurements n is considerably smaller than the number of parameters in the unknown tensor , due to all kinds of restrictions such as time and storage. Fortunately, a variety of high-dimensional tensor data possess intrinsic structures, such as low-rankness [2] and sparsity [27]. This could highly reduce the effective dimension of the parameter and make the accurate estimation possible. Please refer to (III.2) and (VI.2) for low-rankness and sparsity assumptions.
Fig. 1.

Illustration for interaction reformulation and tensor image/video compression
In this paper, we propose a computationally efficient non-convex optimization approach for sparse and low-rank tensor estimation via cubic-sketchings. Our procedure is two-stage:
obtain an initial estimate via the method of tensor moment (motivated by high-order Stein’s identity), and then apply sparse tensor decomposition to the initial estimate to output a warm start;
use a thresholded gradient descent to iteratively refine the warm start in each tensor mode until convergence.
Theoretically, we carefully characterize the optimization and statistical errors at each iteration step. The output estimate is shown to converge in a geometric rate to an estimation with minimax optimal rate in statistical error (in terms of tensor Frobenius norm). In particular, after a logarithmic number of iterations, whenever , the proposed estimator achieves
| (I.2) |
with high probability, where s, K, p, and σ2 are the sparsity, rank, dimension, and noise level, respectively. We further establish the matching minimax lower bound to show that (I.2) is indeed optimal over a large class of sparse low-rank tensors. Our optimality result can be further extended to the non-sparse case (such as tensor regression [3, 17, 28, 29]) – to the best of our knowledge, this is the first statistical rate optimality result in both sparse and non-sparse low-rank tensor regressions.
The above theoretical analyses are non-trivial due to the non-convexity of the empirical risk function, and the need to develop some new high-order sub-Gaussian concentration inequalities. Specifically, the empirical risk function in consideration satisfies neither restricted strong convexity (RSC) condition nor sparse eigenvalue (SE) condition in general. Thus, many previous results, such as the one based on local optima analysis [30, 31, 17], are not directly applicable. Moreover, the structure of cubic-sketching tensor leads to high-order products of sub-Gaussian random variables. Thus, the matrix analysis based on Hoeffding-type or Bernstein-type concentration inequality [32, 33] will lead to sub-optimal statistical rate and sample complexity. This motivates us to develop new high-order concentration inequalities and sparse tensor-spectral-type bound, i.e., Lemmas 1 and 2 in Section IV-C. These new technical results are obtained based on the careful partial truncation of high-order products of sub-Gaussian random variables and the argument of bounded ψα-norm [34], and may be of independent interest.
The literature on low-rank matrix estimation methods, e.g., the spectral method and nuclear norm minimization [35, 36, 37], is also related to this work. However, our cubic sketching model is by-no-means a simple extension from matrix estimation problems. In general, many related concepts or methods for matrix data, such as singular value decomposition, are problematic to apply in the tensor framework [38, 39]. It is also found that simple unfolding or matricizing of tensors may lead to suboptimal results due to the loss of structural information [40]. Technically, the tensor nuclear norm is NP-hard to even approximate [9, 10, 41], and thus the method to handle tensor low-rankness is distinct from the matrix.
The rest of the paper is organized as follows. Section II provides preliminaries on notation and basic knowledge of tensor. A two-stage method for symmetric tensor estimation is proposed in Section III, with the corresponding theoretical analysis given in Section IV. A concrete application to high-order interaction effect models is described in Section V. The non-symmetric tensor estimation model is introduced and discussed in Section VI. Numerical analysis is provided in Section VII to support the proposed procedure and theoretical results of this paper. Section VIII discusses extensions to higher-order tensors. The proofs of technical results are given in supplementary materials.
II. Preliminary
Throughout the paper, vector, matrix, and tensor are denoted by boldface lower-case letters (e.g., x,y), boldface upper-case letters (e.g., X,Y), and script letters (e.g., , ), respectively. For any set A, let |A| be the cardinality. The diag(x) is a diagonal matrix generated by x. For two vectors x and y, x ∘ y is the outer product. Define ‖x‖q ≔ (|x1|q + ⋯ + |xp|q)1/q. We also define the l0 quasi-norm by ‖x‖0 = #{j : xj ≠ 0} and l∞ norm by max1≤j≤p |xj|. Denote the set {1,2,…,n} by [n]. Let ej be the canonical vectors, whose j-th entry equals to 1 and all other entries equal to zero. For any two sequences , , we say if there exists some positive constant C0 and sufficiently large n0 such that |an| ≤ C0bn for all n ≥ n0. We also write an ≍ bn if there exists C,c > 0 such that can ≤ bn ≤ Can for all n ≥ 1. Additionally, C1,C2, …, c1,c2, … are generic constants, whose actual values may be different from line to line.
We next introduce notations and operations on the matrix. For matrices and , their Kronecker product is defined as a (IK)-by-(JL) matrix A ⊗ B = [a1 ⊗ B ⋯aJ ⊗ B], where aj ⊗ B = (aj1B⊤,…,ajIB⊤)⊤. If A and B have the same number of columns J = L, the Khatri-Rao product is defined as . If the matrices A and B are of the same dimension, the Hadamard product is their element-wise matrix product, such that (A*B)ij = Aij ·Bij. For matrix , we also denote the vectorization and column-wise norms as .
In the end, we focus on tensor notation and relevant operations. Interested readers are referred to [2] for more details. Suppose is an order-3 tensor. Then the (i,j,k)-th element of is denoted by . The successive tensor multiplication with vectors , is denoted by . We say is rank-one if it can be written as the outer product of three vectors, i.e., or for all i,j,k. Here “∘” represents the vector outer product. We say is symmetric if for all i,j,k. Then, is rank-one and symmetric if and only if it can be decomposed as for some vector x.
More generally, we may decompose a tensor as the sum of rank one tensors as follows,
| (II.1) |
where , , , . This is the so-called CANDECOMP/PARAFAC, or CP decomposition [2] with CP-rank being defined as the minimum number K such that (II.1) holds. Then, are called factors along first, second and third mode. Note that factors are normalized as unit vectors to guarantee the uniqueness of decomposition, and η = {η1,…,ηK} plays an analogous role of singular values in matrix value decomposition here. Several tensor norms also need to be introduced. The tensor Frobenius norm and tensor spectral norm are defined respectively as
| (II.2) |
where . Clearly, . We also consider the following sparse tensor spectral norm,
| (II.3) |
By definition, . Suppose and are two rank-one tensors. Then it is easy to check and .
III. Symmetric Tensor Estimation Via Cubic Sketchings
In this section, we focus on the estimation of sparse and low-rank symmetric tensors,
| (III.1) |
where xi are random vectors with i.i.d. standard normal entries. As previously discussed, the tensor parameter often satisfies certain low-dimensional structures in practice, among which the factor-wise sparsity and low-rankness [16] commonly appear. We thus assume is CP rank-K for K ≪ p and the corresponding factors are sparse,
| (III.2) |
The CP low-rankness has been widely assumed in literature for its nice scalability and simple formulation [5, 25, 18]. Different from the matrix factor analysis, we do not assume the tensor factors here are orthogonal. On the other hand, since the low-rank tensor estimation is NP-hard in general [42], we will introduce an incoherence condition in the forthcoming Condition 3 to ensure that the correlation among different factors is not too strong. Such a condition has been used in recent literature on tensor data analysis [43], compressed sensing [44], matrix decomposition [45], and dictionary learning [46].
Based on observations , we propose to estimate via minimizing the empirical squared loss since the close-form gradient provides computational convenience,
| (III.3) |
where
| (III.4) |
Equivalently, (III.3) can be written as,
| (III.5) |
Clearly, (III.5) is a non-convex optimization problem. To solve it, we propose a two-stage method as described in the next two subsections.
A. Initialization
Due to the non-convexity of (III.5), a straightforward implementation of many local search algorithms, such as gradient descent and alternating minimization, may easily get trapped into local optimums and result in sub-optimal statistical performance. Inspired by recent advances of spectral method (e.g., EM algorithm [47], phase retrieval [48], and tensor SVD [39]), we propose to evaluate an initial estimate via the method of moment and sparse tensor decomposition (a variant of high-order spectral method) in the following Steps 1 and 2, respectively. The pseudo-code is given in Algorithm 1.
Step 1: Unbiased Empirical Moment Estimator.
Construct the empirical moment-based estimator ,
| (III.6) |
where , ej is the canonical vector.
Based on Lemma 4, is an unbiased estimator of . The construction of (III.6) is motivated by the high-order Stein’s identity ([49]; also see Theorem 7 for a complete statement). Intuitively speaking, based on the third-order score function of a Gaussian random vector , we can construct the unbiased estimator of by properly choosing a continuously differentiable function in high-order Stein’s identity. See the proof of Lemma 4 for details.
Step 2: Sparse Tensor Decomposition.
Based on the method of moment estimator obtained in Step 1, we further obtain good initialization for the factors via truncation and alternating rank-1 power iterations [27, 50],
Note that the tensor power iterations recover one rank-1 component per time. To identify all rank-1 components, we generate a large number of different initialization vectors, implement a clustering step, and choose the centroids as the estimates in the initialization stage. This scheme originally appears in tensor decomposition literature [43, 50], although our problem setting and proof techniques are very different. This procedure is also very different from the matrix setting since the rank-1 component in singular value decomposition is mutually orthogonal, but we do not enforce the exact orthogonality here for .
More specifically, we first choose a large integer M ≫ K and generate M starting vectors through sparse SVD as described in Algorithm 3. Then for each , we apply the following truncated power updates for l = 0,…
where ×2, ×3 are tensor multiplication operators defined in Section II and Td(x) is a truncation operator that sets all but the largest d entries in absolute values to zero for any vector x. It is noteworthy that the symmetry of implies
This means the multiplications along different modes are the same. We run power iterations till its convergence, and denote bm as the outcome. Finally, we apply K-means to partition into K clusters, let the centroids of the output clusters be , and calculate .
|
B. Thresholded Gradient Descent
After obtaining a warm start in the first stage, we propose to apply the thresholded gradient descent to iteratively refine the solution to the non-convex optimization problem (III.5). Specifically, denote , , , and . Since , we let , be the gradient function with respect to B. Based on the detailed calculation in Lemma A.1, can be written as
| (III.7) |
where {(B⊤X)⊤}3 and {(B⊤X)⊤}2 are entry-wise cubic and squared matrices of (B⊤X)⊤. Define φh(x) as the thresholding function with a level h that satisfies the following minimal assumptions:
| (III.8) |
Many widely used thresholding schemes, such as hard thresholding Hh(x) = xI(|x|>h), soft-thresholding Sh(x) = sign(x)max(|x| − h,x), satisfy (III.8). With a slight abuse of notation, we further define the vector thresholding function as .
The initial estimates η(0) and B(0) will be updated by thresholded gradient descent in two steps summarized in Algorithm 2. It is noteworthy that only B is updated in Step 3, while η will be updated in Step 4 after finishing the update of B.
Step 3: Updating B via Thresholded Gradient descent.
We update B(t) via thresholded gradient descent,
| (III.9) |
Here,
μ is the step size and serves as an approximation for (see Lemma 15);
- is the thresholding level defined as
Step 4: Updating η via Normalization.
We normalize each column of B(T) and estimate the weight parameter as
| (III.10) |
The final estimator for is
Remark 1 (Stochastic Thresholded Gradient descent). The evaluation of the gradient (III.7) requires operations at each iteration and can be computationally intense for large n or p. To economize the computational cost, a stochastic version of thresholded gradient descent algorithm can be easily carried out by sampling a subset of summand functions (III.7) at each iteration. This will accelerate the procedure especially in the case of large-scale settings. See Section P2 for details.
|
|
IV. Theoretical Analysis
In this section, we establish the geometric convergence rate in optimization error and minimax optimal rate in statistical error of the proposed symmetric tensor estimator.
A. Assumptions
We first introduce the assumptions for theoretical analysis. Conditions 1–3 are on the true tensor parameter and Conditions 4–5 are on the measurement scheme. Specifically, the first condition ensures the model identifiability for CP-decomposition.
Condition 1 (Uniqueness of CP-decomposition). The CP-decomposition in (III.2) is unique in the sense that if there exists another CP-decomposition , it must have K = K′ and be invariant up to a permutation of {1,…,K}.
For technical purposes, we introduce the following conditions to regularize the CP-decomposition of . Similar assumptions were imposed in recent tensor literature, e.g., [3, 27] and Assumption 1.1 (A4) [51].
Condition 2 (Parameter space). The CP-decomposition satisfies
| (IV.1) |
for some absolute constants C,C′, where and . Recall that s is the sparsity of .
Remark 2. In Condition 2, R plays a similar role as a “condition number.” This assumption means that the tensor is “well-conditioned,” i.e., each rank-1 component is roughly of the same size.
As shown in the seminal work of [42], the estimation of low-rank tensors can be NP-hard in general. Hence, we impose the following incoherence condition.
Condition 3 (Parameter incoherence). The true tensor components are incoherent such that
where R is the singular value ratio defined in (IV.1) and C″ is some small constant.
Remark 3. The preceding incoherence condition has been widely used in different scenarios in recent high-dimensional research, such as tensor decomposition [27, 50], compressed sensing [44], matrix decomposition [45], and dictionary learning [46]. It can be also viewed as a relaxation of orthogonality: if are mutually orthogonal, Γ equals zero. We can show from both theory (Lemma 28 in the supplementary materials) and simulation (Section VII) that the low-rank tensor induced by (III.2) satisfies the incoherence condition with high probability, if the component vectors are randomly generated, say from Gaussian distribution.
We also introduce the following conditions on noise distribution.
Condition 4 (Sub-exponential noise). The noise are i.i.d. randomly generated with mean 0 and variance σ2 satisfying is sub-exponential distributed, i.e., there exists constant Cϵ > 0 such that and is independent of .
The sample complexity condition is crucial for our algorithm especially in the initialization stage. Ignoring any polylog factors, Condition 5 is even weaker than the sparse matrix estimation case (n ≳ s2) in [48].
Condition 5 (Sample complexity).
B. Main Theoretical Results
Our main Theorem 1 shows that based on a proper initializer, the output of the proposed procedure can achieve optimal estimation error rate after a sufficient number of iterations. Here, we define the contraction parameter
and also denote and for some C0 > 0.
Theorem 1 (Statistical and Optimization Errors). Suppose Conditions 3–5 hold, , and the initial estimator satisfy
| (IV.2) |
with probability at least . Assume the step size μ ≤ μ0, where μ0 is defined in (A.14). Then, the output of the thresholded gradient descent update in (III.9) satisfies:
- For any t = 0,1,2,…, the factor-wise estimator satisfies
with probability at least .(IV.3) - When the total number of iterations is no smaller than
there exists a constant C1 (independent of K,s,p,n,σ2) such that the final estimator satisfies(IV.4)
with probability at least .(IV.5)
Remark 4. The error bound (IV.3) can be decomposed into an optimization error (which decays with a geometric rate as iterations) and a statistical error (which does not decay as iterations). In the special case that σ = 0, exactly recover with high probability.
The next theorem shows that Steps 1 and 2 of Algorithm 1 provides a good initializer required in Theorem 1.
Theorem 2 (Initialization Error). Recall . Suppose the number of initializations , where γ is a constant defined in (A.11). Given that Conditions 1–4 hold, the initial estimator obtained from Steps 1–2 with a truncation level s ≤ d ≤ Cs satisfies
| (IV.6) |
and
with probability at least 1 − 5/n, where
| (IV.7) |
Moreover, if the sample complexity condition 5 holds, then the above bound satisfies (IV.2).
Remark 5 (Interpretation of initialization error). The upper bound of (IV.6) consists of two terms that correspond to the approximation error of to and the incoherence among ’s, respectively. Especially, the former converges to zero as n grows while the latter does not.
The proof of Theorems 1 and 2 are postponed to Section C-D in the supplementary materials. The combination of Theorems 1 and 2 immediately yields the following upper bound for the final estimator, which is one main result of this paper.
Theorem 3 (Upper Bound). Suppose Conditions 1 – 5 hold, s ≤ d ≤ Cs. After T* iterations, there exists a constant C1 not depending on K,s,p,n,σ2, such that the proposed procedure yields
| (IV.8) |
with probability at least , where T* is defined in (IV.4).
The above upper bound turns out to match the minimax lower bound for a large class of sparse and low-rank tensors.
Theorem 4 (Lower Bound). Consider the following class of sparse and low-rank tensors,
| (IV.9) |
Suppose that are i.i.d standard normal cubic sketchings with i.i.d. N(0,σ2) noise in (III.1), p ≥ 20s, and s ≥ 4. We have the following lower bound result,
The proof of Theorem 4 is deferred to Section E in the supplementary materials. Combining Theorems 3 and 4, we immediately obtain the following minimax-optimal rate for sparse and low-rank tensor estimation with cubic sketchings when logp ≍ log(p/s):
| (IV.10) |
The rate in (IV.10) sheds light upon the effect of dimension p, noise level σ2, sparsity s, sample size n and rank K to the estimation performance.
Remark 6. Recently, Li, Haupt, and Woodruff [29] studied the optimal sketching for the low-rank tensor regression and gave an near-optimal sketching complexity with a sharp (1 + ε)-worse-case error bound. Different from the framework of [29] that focuses on a deterministic setting, we study a probabilistic model with random observation noises, propose a new algorithm, and studied the minimax optimal rate of estimation errors. In addition, [5, 16, 17] considered different types of convex/non-convex algorithms for low-rank tensor regression with statistical assumptions. To our best knowledge, we are the first to achieve an optimal rate in estimation error based on polynomial-time algorithms for the tensor regression problem.
Remark 7 (Non-sparse low-rank tensor estimation via cubic-sketchings). When the low-rank tensor is not necessarily sparse, i.e.,
we can apply the proposed procedure with all the truncation/thresholding steps removed. If , we can use similar arguments of Theorems 1–3 to show that the estimator satisfies
| (IV.11) |
for any with high probability. Furthermore, similar arguments of Theorem 4 imply that the rate in (IV.11) is minimax optimal.
Remark 8 (Comparison with existing matrix results). Our cubic sketching tensor results are far more than extensions of the existing matrix ones. For example, [32, 33] studied the low-rank matrix recovery via rank-1 projections: and proposed the convex nuclear norm minimization methods. The theoretical properties of their estimate are analyzed under a or Restricted Uniform Boundedness condition (RUB). However, the tensor nuclear norm is computationally infeasible and one can check that our cubic sketching framework does not satisfy RIP or RUB conditions in general following the arguments in [48, 52]. Thus, these previous results cannot be directly applied.
In addition, the analysis of gradient updates for the tensor case is significantly more complicated than the matrix case. First, it requires high-order concentration inequalities for the tensor case since the cubic-sketching tensor leads to high-order products of sub-Gaussian random variables (see Section IV-C for details). The necessity of high-order expansions in the analysis of gradient updates for the tensor case also significantly increases the hardness of the problem. To ensure the geometric convergence, we need much more subtle analysis comparing to the ones in the matrix case [52].
C. Key Lemmas: High-order Concentration Inequalities
As mentioned earlier, one major challenge for theoretical analysis of cubic sketching is to handle heavy tails of high-order Gaussian moments. One can only handle up-to second moments of sub-Gaussian random variables by directly applying the Hoeffding’s or Bernstein’s concentration inequalities. Therefore, we need to develop the following high-order concentration inequalities as technical tools: Lemma 1 characterizes the tail bounds for the sum of sub-Gaussian products, and Lemma 2 provides the concentration inequalities for Gaussian cubic sketchings. The proofs of Lemmas 1 and 2 are given in Section B.
Lemma 1 (Concentration inequality for sum of sub-Gaussian products). Suppose are n i.i.d random matrices. Here, suppose xij, the j-th row of Xi, is an isotropic sub-Gaussian vector, i.e., and Cov(xij) = I. Then for any vectors , , and 0 < δ < 1, we have
with probability at least 1 – δ for some constant C.
Note that in Lemma 1, each Xi does not necessarily have independent entries, even though are independent matrices. Building on Lemma 1, Lemma 2 provides a generic spectral-type concentration inequality that can be used to quantify the approximation error of introduced in Step 1 of the proposed procedure.
Lemma 2 (Concentration inequality for Gaussian cubic sketchings). Suppose , , , , , are fixed vectors.
- Define . Then and
with probability at least 1 − 10/n3 − 1/p. - Define . Then and
with probability at least 1 − 10/n3 − 1/p.
Here, C is an absolute constant and ‖ · ‖s is the sparse tensor spectral norm defined in (II.3).
V. Application To High-Order Interaction Effect Models
In this section, we study the high-order interaction effect model in the cubic sketching framework. Specifically, we consider the following three-way interaction model
| (V.1) |
for l = 1,…,n. Here ξ, γ, and η are coefficients for the main effect, pairwise interaction, and triple-wise interaction, respectively. More importantly, (V.1) can be reformulated into the following tensor form (also see the left panel of Figure 1)
| (V.2) |
where and is a tensor parameter corresponding to coefficients in the following way:
| (V.3) |
We provide the following justification for assuming the tensorized coefficient is low-rank and sparse. First, in modern applications, such as the biomedical research [53], the response is often driven by a small portion of coefficients and a small number of factors, leading to a highly entry-wise sparse and low-rank . Second, [54] suggested that it is suitable to model entry-wise sparse and low-enough rank tensors as arising from sparse loadings. Therefore, we assume is CP rank-K with s-sparse factors:
where K,s ≪ p. Then the number of parameters in (V.4), K(p + 1), is significantly smaller than (p + 1)3, the total number of parameters in the original three-way interaction effect model (V.1), which makes the consistent estimation of possible in the high-dimensional case. In this case, (V.2) can be written as
| (V.4) |
where l ∈ [n], ‖βk‖2 = 1, ‖βk‖0 ≤ s,k ∈ [K].
By assuming , the high-order interaction effect model (V.2) reduces to the symmetric tensor estimation model (III.1), except one slight difference that the first coordinate of xl, i.e., the intercept, is always 1. To accommodate this difference, we only need to adjust the initial unbiased estimate in the above two-step procedure. Let
| (V.5) |
where . Then we construct the empirical moment-based initial tensor Ts′ as
For i,j,k ≠ 0, , , , and .
For i ≠ 0, .
.
Lemma 5 shows that is an unbiased estimator for .
The theoretical results in Section IV imply the following upper and lower bounds for the three-way interaction effect estimation.
Corollary 1. Suppose z1,…,zn are i.i.d. standard Gaussian random vectors and satisfies Conditions 1, 2 and 3. The output, denoted as , from the proposed Algorithms 1 and 2 based on satisfies
| (V.6) |
with high probability. On the other hand, considering the following class of ,
Then the following lower bound holds,
VI. Non-Symmetric Tensor Estimation Model
In this section, we extend the previous results to the non-symmetric tensor case. Specifically, we have and
| (VI.1) |
where , , are random vectors with i.i.d. standard normal entries. Again, we assume is sparse and low-rank in a similar sense that
| (VI.2) |
Denote
B1 = (β11, ⋯, β1K), B2 = (β21, ⋯, β2K), B3 = (β31, ⋯, β3K),
U = (u1,…,un), V = (v1,…,vn), W = (w1,…,wn), η = (η1,…,ηk)⊤,y = (y1,…,yn)⊤.
Then, the empirical risk function can be written compactly as
| (VI.3) |
Since (VI.3) is non-convex but fortunately tri-convex in terms of B1, B2, and B3, we develop a block-wise thresholded gradient descent algorithm as detailed below. The complete algorithm is deferred to Section O1 in the supplementary materials.
Step 1: (Method of Tensor Moments)
Construct the empirical moment-based estimator
| (VI.4) |
to which sparse tensor decomposition is applied for initialization.
Step 2: (Block-wise Gradient Descent)
Lemma 17 shows that the gradient function for (VI.3) with respect to B1 can be written as
| (VI.5) |
where and . For t = 1, …, T, we fix , and update via block-wise thresholded gradient descent,
where , μ is the step size, and . The updates of B2,B3 are similar.
The theoretical analysis for the non-symmetric case is different from the symmetric one in two folds. First, the non-symmetric cubic sketching tensor is formed by three Gaussian vectors rather than one, which leads to many differences in the calculation of high-order moments. Second, the CP-decomposition of non-symmetric tensor (VI.2) forms a tri-convex optimization. At this point, the standard convex analysis for vanilla gradient descent [55] could be applied given a proper initialization.
With the regularity conditions detailed in Section O1, we present the theoretical results for non-symmetric tensor estimation as follows.
Theorem 5 (Upper Bound). Suppose Conditions 6 – 9 hold and n ≳ (slog(p0/s))3/2, where p0 = max{p1,p2,p3}. For any t = 0,1,2,…, the output of Algorithm O1 satisfies
for some 0 < κ < 1. When the total number of iterations is no smaller than , the final estimator satisfies
Theorem 6 (Lower Bound). Consider the class of incoherent sparse and low-rank tensors . If are i.i.d standard normal cubic sketchings, , min{p1,p2,p3} ≥ 20s, and s ≥ 4, we have
| (VI.6) |
Theorems 5 and 6 imply that the proposed algorithm achieves a minimax-optimal rate of estimation error in the class of as long as log(p0) ≍ log(p0/s).
VII. Numerical Results
In this section, we investigate the effect of noise level, CP-rank, sample size, dimension, and sparsity on the estimation performance by simulation studies. We also investigate the numerical performance of the proposed algorithm when the incoherence assumption required in the theoretical analysis fails to hold.
In each setting, we generate , where , the support of is uniformly selected from {1,…,p}, and the nonzero entries of are drawn randomly from standard normal distribution. Then, we calculate and normalize . The cubic sketchings are generated as and . The noise satisfies or . Additionally, we adopt the following stopping rules in iterations: (1) the initialization iteration (Step 2 in Algorithm 1) is stopped if ; (2) the gradient update iteration (Step 3 in Algorithm 2) is stopped if ‖B(T+1) − B(T) ‖F ≤ 10−6. The numerical results are based on 200 repetitions unless otherwise specified. The code was written in R and implemented on an Intel Xeon-E5 processor with 64 GB of RAM.
First, we consider the percentage of successful recovery in the noiseless case. Let K = 3, s/p = 0.3, p = 30 or 50, so that the total number of unknown parameters in is 2.7 × 104 or 1.25 × 105. The sample size n ranges from 500 to 6000. Each recovery is called “successful” if the relative error . We report the average successful recovery rate in Figure VII. We can see from Figure VII that the empirical relation among successful recovery, dimension, and sample size is consistent with the theoretical results in Section IV.
We then move to the noisy case. Select K = 3, s/p = 0.3, p ∈ {30, 50}, . We consider two scenarios: (1) sample size n = 6000, 8000, or 10000, s/p = 0.3, the noise level σ varies from 0 to 200; (2) noise level σ = 200, sample size n varies from 4000 to 10000, p = 30, s/p = 0.1,0.3,0.5. The estimation errors in terms of in these two scenarios are plotted in Figures 3 and 4, respectively. These results show that the proposed procedure achieves a good performance – Algorithms 1 and 2 yield more accurate estimation with smaller variance σ2 and/or large value of sample size n.
Fig. 3.

Estimation error under different noise levels. Left panel: p = 30, right panel: p = 50
Fig. 4.

Estimation error under different dimension/sample ratio (n/p3). Left panel: initial estimation error, right panel: final estimation error
Next, we demonstrate that the low-rank tensor parameter with randomly generated factors satisfies the incoherence condition 3 with high probability. Set the CP-rank K = 3 and the sparsity level s/p = 0.3 with the dimension p ranging from 10 to 2000. We compute the incoherence parameter Γ defined in Condition 3. The left panel of Figure 5 shows that the incoherence parameter Γ decays in a polynomial rate as s grows, which matches the bound in Condition 3. Recall a theoretical justification on this point is also provided in Lemma 28.
Fig. 5.

Left panel: incoherence parameter Γ with varying sparsity. Here, the red line corresponds to the rate required in the theoretical analysis. Right panel: average relative estimation error for tensors with varying incoherence.
We further examine the performance of the proposed algorithm when the incoherence condition required in the theoretical analysis fails to hold. Specifically, we set the CP-rank K = 3, p = 30, and the sparsity level s/p = 0.3. We construct enormous copies of tensor parameter with i.i.d. standard normal factor vectors . For each , we calculate the incoherence Γj defined in Condition 3, then manually pick 40 such that
In this way, we obtain a set of tensor parameters with incoherence uniformly varying from 0 to 0.4. The right panel of Figure 5 plots the relative error for estimating based on observations from cubic sketchings of based on 1000 repetitions. We can see that the proposed algorithm achieves small relative errors even when the true factors are highly coherent.
Moreover, we consider a setting with Laplacian noise. Suppose with density . With n = 3000, p = 30, and varying values of σ, the average estimation error and its comparison with Gaussian noise setting are provided in Figure 6. We note that the estimation errors under Laplace noise are slightly higher than those under Gaussian noise.
Fig. 6.

Comparison of estimation errors between Laplace error and Gaussian error
We also compare the estimation errors of initial and final estimators for different ranks and sample sizes. Set K = 3,p = 30,s/p = 0.3 and consider the noiseless setting. It is clear from Figure 7 that the initialization error decays sufficiently, but does not converge to zero as sample size n grows. This result matches our theoretical findings in Theorem 2: as discussed in Remark 5, the initial stage may yield an inconsistent estimator due to the incoherence among βk’s. We also evaluate and compare the estimation errors for both initial and final estimators. From the right panel of Figure 7, we can see that the final estimator is more stable and accurate compared to the initial one, which illustrates the merit of thresholded gradient descent step of the proposed procedure.
Fig. 7.

Log relative estimation error of initial estimation error (left panel) and initialization/final estimation error (right panel)
Finally, we compare the performance of the proposed method with the alternating least square (ALS)-based tensor regression method [3]. We specifically consider two schemes for the initialization of ALS: (a) are i.i.d. standard Gaussian (cold start), and (b) are generated from the proposed Algorithm 1 (warm start). Setting K = 2, s/p = 0.2, p = 30, , we apply both the proposed procedure and the ALS-based algorithm and record the average estimation errors with standard deviations for both initial and final estimators. From the result in Table VII, one can see the proposed algorithm significantly outperforms the ALS under both cold and warm start schemes. The main reason is pointed out in Remark 8: the cubic sketching setting possesses distinct aspects compared with the i.i.d. random Gaussian sketching setting, so that the method proposed by [3] does not exactly fit here.
VIII. Discussions
This paper focuses on the third order tensor estimation via cubic sketchings. Moreover, all results can be extended to the higher-order case via high-order sketchings. To be specific, suppose
where is an order-m, sparse, and low-rank tensor. In order to estimate based on , one can first construct the order-m moment-based estimator using a generalized version of Theorem 7 and the fact that the score functions for the density function p(x) satisfy a nice recursive equation:
Then, one can similarly perform high-order sparse tensor decomposition and thresholded gradient descent to estimate . On the theoretical side, we can show if mild conditions hold and n ≥ C(logn)m(slogp)m/2, the proposed procedure achieves
with high probability. The minimax optimality can be shown similarly.
Fig. 2.

Successful rate of recovery with varying sample size
TABLE I.
Estimation Error And Standard Deviation (In Subscript) Of The Proposed Method And Als-Based Method
| Sample size | ours | warm start | cold start | initial |
|---|---|---|---|---|
| n = 4000 | 4.020.13 | 32.821.79 | 37.781.23 | 38.031.74 |
| n = 5000 | 4.020.13 | 32.342.34 | 36.962.10 | 33.711.78 |
| n = 6000 | 1.770.09 | 22.221.21 | 59.973.40 | 25.571.48 |
Acknowledgment
Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014- 18-2759). While completing this work, Guang Cheng was a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality. Anru Zhang would like to acknowledge support by NSF CAREER-1944904, NSF DMS-1811868, and NIH R01 GM131399.
Biography
Botao Hao received B.S. degree from the School of Mathematics, Nankai University, China, in 2014, and Ph.D. from the Department of Statistics, Purdue University, USA, 2019. He is currently a postdoctoral researcher in Department of Electrical Engineering, Princeton University, USA.
Anru Zhang received the Ph.D. degree from University of Pennsylvania, Philadelphia, PA, in 2015, and B.S. degree from Peking University, Beijing, China, in 2010. He is currently an assistant professor in Statistics at the University of Wisconsin-Madison, Madison, WI. His current research interests include high-dimensional statistical inference, tensor data analysis, statistical learning theory, dimension reduction, and convex/non-convex optimization.
Guang Cheng received BA degree in Economics from Tsinghua University, China, in 2002, and PhD degree from University of Wisconsin–Madison in 2006. He then joined Dept of Statistics at Duke University as Visiting Assistant Professor and Postdoc Fellow in SAMSI. He is currently Professor in Statistics at Purdue University, directing Big Data Theory research group, whose main goal is to develop computationally efficient inferential tools for big data with statistical guarantees.
Appendix
This appendix contains five parts: (1) Sections A–B provide detailed proofs for empirical moment estimator and concentration results; (2) Sections C–N provide additional proofs for the main theoretical results of this paper; (4) Section O covers the pseudo-code, conditions and main proofs of non-symmetric tensor estimation; (5) Section P discusses the matrix form of gradient function and stochastic gradient descent; (6) Section Q provides several technical lemmas and their proofs.
A. Moment Calculation
We first introduce three lemmas to show that the empirical moment based tensors (III.6), (V.5), and (VI.4) are all unbiased estimators for the target low-rank tensor in the corresponding scenarios. Detail proofs of three lemmas are postponed to Sections G1, G2 and G3 in the supplementary materials.
Lemma 3 (Unbiasedness of moment estimator under non-symmetric sketchings). For non-symmetric tensor estimation model (VI.1) & (VI.2), define the empirical moment-based tensor by
Then is an unbiased estimator for , i.e.,
The extension to the symmetric case is non-trivial due to the dependency among three identical sketching vectors. We borrow the idea of high-order Stein’s identity, which was originally proposed in [49]. To fix the idea, we present only third order result for simplicity. The extension to higher-order is straightforward.
Theorem 7 (Third-order Stein’s Identity, [49]). Let be a random vector with joint density function p(x). Define the third order score function as . Then for continuously differentiable function , we have
| (A.1) |
In general, the order-m high-order score function is defined as
Interestingly, the high-order score function has a recursive differential representation
| (A.2) |
with . This recursive form is helpful for constructing unbiased tensor estimator under symmetric cubic sketchings. Note that the first order score function is the same as score function in Lemma 26 (Stein’s lemma [56]). The proof of Theorem 7 relies on iteratively applying the recursion representation of score function (A.2) and the first-order Stein’s lemma (Lemma 26). We provide the detailed proof in Section F for the sake of completeness.
In particular, if x follows a standard Gaussian vector, each order score function can be calculated based on (A.2) as follows,
| (A.3) |
Interestingly, if we let , then
| (A.4) |
which is exactly . Connecting this fact with (A.1), we are able to construct the unbiased estimator in the following lemma through high-order Stein’s identity.
Lemma 4 (Unbiasedness of moment estimator under symmetric sketchings). Consider the symmetric tensor estimation model (III.1) & (IV.9). Define the empirical first-order moment . If we further define an empirical third-order-moment-based tensor by
then
Proof. Note that yi = G(xi) + ϵi. Then we have
where is defined in (A.3). By using the conclusion in Theorem 7 and the fact (A.4), we obtain
since ϵi is independent of xi. This ends the proof. ■
Although the interaction effect model (V.1) is still based on symmetric sketchings, we need much more careful construction for the moment-based estimator, since the first coordinate of the sketching vector is always constant 1. We give such an estimator in the following lemma.
Lemma 5 (Unbiasedness of moment estimator in interaction model). For interaction effect model (V.1), construct the empirical moment based tensor as following
For i,j,k ≠ 06, . And , , .
For i ≠ 0, .
The is an unbiased estimator for i.e.,
B. Proofs of Lemmas 1 and 2: Concentration Inequalities
We aim to prove Lemmas 1 and 2 in this subsection. These two lemmas provide key concentration inequalities of the theoretical analysis for the main result. Before going into technical details, we introduce a quasi-norm called ψα-norm.
Definition 1 (ψα-norm [34]). The ψα-norm of any random variable X and α > 0 is defined as
Particularly, a random variable who has a bounded ψ2-norm or bounded ψ1-norm is called sub-Gaussian or sub-exponential random variable, respectively. Next lemma provides an upper bound for the p-th moment of sum of random variables with bounded ψα-norm.
Lemma 6. Suppose X1,…,Xn are n independent random variables satisfying with α > 0, then for all and p ≥ 2,
| (A.5) |
where 1/α* + 1/α = 1, C1(α),C2(α) are some absolute constants only depending on α.
If 0 < α < 1, (A.5) is a combination of Theorem 6.2 in [57] and the fact that the p-th moment of a Weibull variable with parameter α is of order p1/α. If α ≥ 1, (A.5) follows from a combination of Corollaries 2.9 and 2.10 in [58]. Continuing with standard symmetrization arguments, we reach the conclusion for general random variables. When α = 1 or 2, (A.5) coincides with standard moment bounds for a sum of sub-Gaussian and sub-exponential random variables in [59]. The detailed proof of Lemma 6 is postponed to Section H.
When 0 < α < 1, by Chebyshev’s inequality, one can obtain the following exponential tail bound for the sum of random variables with bounded ψα-norm. This lemma generalizes the Hoeffding-type concentration inequality for sub-Gaussian random variables (see, e.g. Proposition 5.10 in [59]), and Bernstein-type concentration inequality for sub-exponential random variables (see, e.g. Proposition 5.16 in [59]).
Lemma 7. Suppose 0 < α < 1, X1,…,Xn are independent random variables satisfying . Then there exists absolute constant C(α) only depending on α such that for any and 0 < δ < 1/e2,
with probability at least 1 − δ.
Proof. For any t > 0, by Markov’s inequality,
where the last inequality is from Lemma 6. We set t such that . Then for p ≥ 2,
holds with probability at least 1 − exp(−p) Letting δ = exp(−p), we have that for any 0 < δ < 1/e2,
holds with probability at least 1 − δ. This ends the proof. ■
The next lemma provides an upper bound for the product of random variables in ψα-norm.
Lemma 8 (ψα for product of random variables). Suppose X1,…,Xm are m random variables (not necessarily independent) with ψα-norm bounded by . Then the ψα/m-norm of is bounded as
Proof. For any and α > 0, by using the inequality of arithmetic and geometric means we have
Since exponential function is a monotone increasing function, it shows that
| (A.6) |
From the definition of ψα-norm, for j = 1,2,…,m, each individual Xj has
| (A.7) |
Putting (A.6) and (A.7) together, we obtain
Therefore, we conclude that the ψα/m-norm of is bounded by .
Proof of Lemma 1. Note that for any j = 1,2,…,m, the ψ2-norm of is bounded by ‖βj‖2 [59]. According to Lemma 8, the ψ2/m-norm of is bounded by . Directly applying Lemma 7, we reach the conclusion.
Proof of Lemma 2. We first focus on the non-symmetric version and the proof follows three steps:
Truncate the first coordinate of x1i, x2i, x3i by a carefully chosen truncation level;
Utilize the high-order concentration inequality in Lemma 20 at order three;
Show that the bias caused by truncation is negligible.
With slightly abuse of notations, we denote a, x, y etc. as their first coordinate of a, x, y etc. Without loss of generality, we assume p ≔ max{p1,p2,p3}. By unitary invariance, we assume β1 = β2 = β3 = e1, where e1 = (1,0,…,0)⊤.
Then, it is equivalent to prove
Suppose and are n independent samples of {x1, x2, x3}. And define a bounded event for the first coordinate and its corresponding population version,
where M is a large constant to be specified later. Let upper bounded by M1 + M2 where
and
We will prove M2 that is negligible in terms of convergence rate of M1.
Bounding M1. For simplicity, we define , , and are n independent samples of . According to the law of total probability, we have
where
According to Lemma 22, the entry of are sub-Gaussian random variable with ψ2-norm M2. Applying Lemma 20, we obtain
where δn,s = ((slog(p/s))3/n2)1/2 + (slog(p/s)/n)1/2.
On the other hand,
Putting the above bounds together, we obtain
By setting , the bound of M1 reduces to
| (A.8) |
Bounding M2. From the definitions of M2 and sparse spectral norm,
where
Since x1j is independent of x1k for any j ≠ k, . Similar results hold for x2,x3. Then we have
By the basic property of Gaussian random variable, we can show
Plugging them into M2, we have
where the last inequality holds for a large M > 0. By the choice of , we have for some constant C2. When n is large, this rate is negligible comparing with (A.8)
Bounding M: We put the upper bounds of M1 and M2 together. After some adjustments for absolute constant, it suffices to obtain
with probability at least 1 − 10/n3 − 1/p. This concludes the proof of non-symmetric part. The proof of symmetric part remains similar and thus is omitted here. ■
C. Proof of Theorem 2: Initialization Effect
Theorem 2 gives an approximation error upper bound for the sparse-tensor-decomposition-based initial estimator. In Step I of Section III-A, the original problem can be reformatted to a version of tensor denoising:
| (A.9) |
The key difference between our model (A.9) and recent works [50, 27] is that E arises from empirical moment approximation, rather than the random observation noise considered in [50] and [27]. Next lemma gives an upper bound for the approximation error. The proof of Lemma 9 is deferred to Section I.
Lemma 9 (Approximation error of ). Recall that , where is defined in (III.6). Suppose Condition 4 is satisfied and s ≤ d ≤ Cs. Then
| (A.10) |
with probability at least 1 − 5/n for some uniform constant C1.
Next we denote the following quantity for simplicity,
| (A.11) |
where R is the singular value ratio, K is the CP-rank, s is the sparsity parameter, Γ is the incoherence parameter and C2 is uniform constant.
Next lemma provides theoretical guarantees for sparse tensor decomposition method.
Lemma 10. Suppose that the symmetric tensor denoising model (A.9) satisfies Conditions 1, 2 and 3 (i.e., the identifiability, parameter space and incoherence). Assume the number of initializations and the number of iterations for constants C3,C4, the truncation parameter s ≤ d ≤ Cs. Then the sparse-tensor-decomposition-based initialization satisfies
| (A.12) |
for any k ∈ [K]
The proof of Lemma 10 essentially follows Theorem 3.9 in [27], we thus omit the detailed proof here. The upper bound in (A.12) contains two terms: and , which are due to the empirical moment approximation and the incoherence among different βk, respectively.
Although the sparse tensor decomposition is not optimal in statistical rate, it does offer a reasonable initial estimation provided enough samples. Equipped with (A.10) and Condition 2, the right side of (A.12) reduces to
with probability at least 1−5/n. Denote C0 = 4·2160·C1C4. Using Conditions 3 and 5, we reach the conclusion that
with probability at least 1 − 5/n.
D. Proof of Theorem 1: Gradient Update
We first introduce the following lemma to illustrate the improvement of one step thresholded gradient update under suitable conditions. The error bound includes two parts: the optimization error that describes one step effect for gradient update, and the statistical error that reflects the random noise effect. The proof of Lemma 11 is given in Section J. For notation simplicity, we drop the superscript of in the following proof.
Lemma 11. Let t ≥ 0 be an integer. Suppose Conditions 1–5 hold and satisfies the following upper bound
| (A.13) |
with probability at least , where . As long as the step size μ satisfies
| (A.14) |
then can be upper bounded as
with probability at least .
In order to apply Lemma 11, we prove that the required condition (A.13) holds at every iteration step t by induction. When t = 0, by (IV.2) and Condition 2,
holds with probability at least . Since the initial estimator output by first stage is normalized, i.e., , by triangle inequality we have
Note that
This implies
with probability at least . Taking the summation over k ∈ [K], we have
with probability at least , which means (A.13) holds for t = 0.
Suppose (A.13) holds at the iteration step t − 1, which implies
Since Condition 5 automatically implies
for a sufficiently large C0, we can obtain
By induction, (A.13) holds at each iteration step.
Now we are able to use Lemma 11 recursively to complete the proof. Repeatedly using Lemma 11, we have for t = 1, 2, …,
with probability at least . This concludes the first part of Theorem 1.
When the total number of iterations is no smaller than
the statistical error will dominate the whole error bound in the sense that
| (A.15) |
with probability at least .
The next lemma shows that the Frobenius norm distance between two tensors can be bounded by the distances between each factors in their CP decomposition. The proof of this lemma is provided in Section K.
Lemma 12. Suppose and have CP-decomposition and . If , then
Denote . Combing (A.15) and Lemma 12, we have
with probability at least . By setting C1 = 9C2/4, we complete the proof of Theorem 1.
E. Proofs of Theorems 4 and 6: Minimax Lower Bounds
We first consider the proof for Theorem 6 on non-symmetric tensor estimation. Without loss of generality we assume p = max{p1,p2,p3}. We uniformly randomly generate as MK subsets of {1,…,p} with cardinality of s. Here M > 0 is a large integer to be specified later. Then we construct as
λ > 0 will also be specified a little while later. Clearly, for any 1 ≤ k ≤ K, 1 ≤ m1,m2 ≤ M hyper-geometric distribution: .
Let
| (A.16) |
then for any s/2 ≤ t ≤ s,
Thus, if η > 0, the moment generating function of satisfies
Here, (*) is due to η > 0 and ⌊s/2⌋ + 1 ≥ s/2. By setting η = log((p − s + 1)/(8s)), we have
| (A.17) |
Since p ≥ 20s and s ≥ 4, we have
Combining the two inequalities above, we have.
for c0 = 1/20.
Next we choose M = ⌊exp(c0/2 · sK log(p/s))⌋. Note that
then we further have
which means there are positive probability that satisfy
| (A.18) |
For the rest of the proof, we fix to be the set of vectors satisfying (A.18).
Next, recall the canonical basis . Define
For each tensor and n i.i.d. Gaussian sketches , we denote the response
where . Clearly, y (y(m), u, v, w) follows a joint distribution, which may vary based on different values of m.
In this step, we analyze the Kullback-Leibler divergence between different distribution pairs:
Note that conditioning on fixed values of u, v, w,
By the KL-divergence formula for Gaussian distribution,
Therefore, for any m1 ≠ m2,
Meanwhile, for any 1 ≤ m1 < m2 ≤ M,
By generalized Fano’s Lemma (see, e.g., [60]),
Finally we set for some small constant c > 0, then
which has finished the proof of Theorem 6.
For the proof for Theorem 4, without loss of generality we assume K is a multiple of 3. We first partition {1,…,p} into two subintervals: I1 = {1,…,p−K/3},I2 = {p−K/3+1,…,p}, randomly generate as (MK/3) subsets of {1, …, p – K/3} and construct as
With M = exp(csK log(p/s)) and similar techniques as previous proof, one can show there exists positive possibility that
We then construct the following candidate symmetric tensors by blockwise design,
Then we can see for any ,
The rest of the proof essentially follows from the proof of Theorem 6. ■
F. Proof of Theorem 7: High-order Stein’s Lemma
The proof of this theorem follows from the one of Theorem 6 in [49]. For the sake of completeness, we restate the detail here. Applying the recursion representation of score function (A.2), we have
Then, we apply the first-order Stein’s lemma (see Lemma 26) on function and obtain
Repeating the above argument two more times, we reach the conclusion. ■
G. Proofs of Lemmas 3, 4, and 5: Moment Calculation
In this subsection, we present the detail proofs of moment calculation, including non-symmetric case, symmetric case, and interaction model.
1). Proof of Lemma 3:
By the definition of {yi} in (VI.1) & (VI.2), we have
| (A.19) |
First, we observe due to the independence between ϵi and {ui,vi,wi}. Then, we consider a single component from a single observation
For notation simplicity, we drop the subscript i for i-th observation and k for k-th component such that
| (A.20) |
Each entry of M can be calculated as follows
which implies M = β1 ∘ β2 ∘ β3. Combining with n observations and K components, we can obtain
This finished our proof. ■
2). Proof of Lemma 4:
In this subsection, we provide an alternative and more direct proof for Lemma 4. We consider a similar single component of (A.20) but with a symmetric structure, namely, . Based on the symmetry of both underlying tensor and sketchings, we will verify the following three cases:
- When i = j = k, then
The last equation is due to ‖β*‖2 = 1. - When i ≠ j ≠ k, then
- When i = j ≠ k, then
Therefore, it is sufficient to calculate Ms by
The first term is the bias term due to correlations among symmetric sketchings. Denote and note that Therefore, the empirical first-order moment M1 could be used to remove the bias term as follows
This finishes our proof. ■
3). Proof of Lemma 5:
As before, consider a single component first. For notation simplicity, we drop the subscript l for l-th observation and k for k-th component. Since each component is normalized, the entry-wise expectation of (β⊤x)3 x ∘ x ∘ x can be calculated as
Due to the symmetric structure and non-randomness of first coordinate, there are bias appearing for each entry. For i,j,k ≠ 0, we could use to remove the bias as shown in the previous proof of Lemma 4. For the subscript involving 0, the following two calculations work for removing the bias,
This ends the proof. ■
H. Proof of Lemma 6
Recall the ‖X‖ψα is defined in Definition 1. Without loss of generality, we assume ‖X‖ψα = 1 and throughout this proof. Let β = (log2)i/α and Zi = (|Xi| = − β)+, where (x)+ = x if x ≥ 0 and (x)+ = 0 if else. For notation simplicity, we define for a random variable X. The following step is to estimate the moment of linear combinations of variables .
According to the symmetrization inequality (e.g., Proposition 6.3 of [61]), we have
| (A.21) |
where are independent Rademacher random variables and we notice that εiXi and εi|Xi| are identically distributed. Moreover, if |Xi| ≥ β, the definition of Zi implies that |Xi| = Zi + β. And if |Xi| < β, we have Zi = 0. Thus, we have |Xi| ≤ Zi + β at any time and it leads to
| (A.22) |
By triangle inequality,
| (A.23) |
Next, we will bound the second term of the RHS of (A.23). In particular, we will utilize Khinchin-Kahane inequality, whose formal statement is included in Lemma 27 for the sake of completeness. From Lemma 27 we have
| (A.24) |
Since are independent Rademacher random variables, some simple calculations implies
| (A.25) |
| (A.26) |
Combining inequalities (A.22)–(A.25),
| (A.27) |
Let are independent symmetric random variables satisfying for all t ≥ 0. Then we have
which implies
| (A.28) |
since εiYi and Yi have the same distribution due to symmetry. Combining (A.27) and (A.28) together, we reach
| (A.29) |
For 0 < α < 1, it follows Lemma 25 that
| (A.30) |
where C1(α) is some absolute constant only depending on α.
For α ≥ 1, we will combine Lemma 24 and the method of the integration by parts to pass from tail bound result to moment bound result. Recall that for every non-negative random variable X, integration by parts yields the identity
Applying this to and changing the variable t = tp, then we have
| (A.31) |
where the inequality is from Lemma 24 for all p ≥ 2 and 1/α + 1/α* = 1. In this following, we bound the integral in three steps:
- If , (A.31) reduces to
Letting , we have
where the second equation is from the density of Gamma random variable. Thus,(A.32)
Since 0 < β < 1, the conclusion can be reached by combining (A.29),(A.30) and (A.34). ■
I. Proof of Lemma 9
Firstly, let us consider the non-symmetric perturbation error analysis. According to Lemma 3, the exact form of is given by
We decompose it by a concentration term and a noise term as follows,
| (A.35) |
where
Bounding : For k-th componet of , we denote
Define
By using Lemma 2 and s ≤ d ≤ Cs, it suffices to have for some absolute constant C11,
with probability at least 1 − 10/n3, where ‖ · ‖s+d is the sparse tensor spectral norm defined in (II.3). Equipped with the triangle inequality, the sparse tensor spectral norm for can be bounded by
| (A.36) |
with probability at least 1 − 10K/n3.
Bounding : Note that the random noise is independent of sketching vector {ui,vi,wi}. For fixed , applying Lemma 20, we have for some absolute constant C12
with probability at least 1−1/p. According to Lemma 23, we have
| (A.37) |
Bounding : Putting (A.36) and (A.37) together, we obtain
with probability at least 1 − 5/n. Under Condition 9, we have
with probability at least 1 − 5/n.
The perturbation error analysis for the symmetric tensor estimation model and the interaction effect model is similar since the empirical first-order moment converges much faster than the empirical third-order moment. So we omit the detailed proof here. ■
J. Proof of Lemma 11
Lemma 11 quantifies one step update for thresholded gradient update. The proof consists of two parts.
First, we evaluate an oracle estimator with known support information, which is defined as
| (A.38) |
Here,
is the k-th component of h(B(t)) defined in (III-B)
, where .
For a vector and a subset A ⊂ {1,…,p}, we denote by keeping the coordinates of x with indices in A unchanged, while changing all other components to zero.
We will show that converges as a geometric rate for optimization error and an optimal rate for statistical error. See Lemma 13 for details.
Second, we aim to prove that and are almost equivalent with high probability. See Lemma 14 for details. For simplicity, we drop the superscript of , F(t) in the following proof, and denote , and F(t+1) by , and F+ respectively.
Lemma 13. Suppose Conditions 1–5 hold. Assume (A.13) is satisfied and |F| ≲ Ks. As long as the step size μ ≤ 32R−20/3/(3K[220 + 270K]2), we obtain the upper bound for ,
| (A.39) |
with probability at least 1 − (21K2 + 11K + 4Ks)/n.
The proof of Lemma 13 is postponed to the Section L. Next lemma guarantees that with high probability, is equivalent to the oracle update with high probability.
Lemma 14. Recall that the truncation level h(βk) is defined as
| (A.40) |
If |F| ≲ Ks, we have for any k ∈ [K] with probability at least 1 − (n2p)−1 and F+ ⊂ F.
The proof of Lemma 14 is postponed to the Section L. By using Lemma 14 and induction, we have
It implies for every t, we have |F(t)| ≲ Ks. Combining with Lemmas 13 and 14 together, we obtain with probability at least 1 − (21K2 + 11K + 4Ks)/n,
| (A.41) |
This ends the proof. ■
K. Proof of Lemma 12
Based on the CP low-rank structure of true tensor parameter , we can explicitly write down the distance between and under tensor Frobenius norm as follows
For notation simplicity, denote . Then
Since (a + b + c)2 ≤ 3(a2 + b2 + c2), we have
Equipped with Cauchy-Schwarz inequality, RHS can be further bounded by
At the same time, using for k ∈ [K],
For the non-symmetric tensor estimation model, we have
Following the same strategy above, we obtain
This ends the proof. ■
L. Proof of Lemma 13
First of all, we state a lemma to illustrate the effect of weight ϕ. The proof of Lemma 15 is deferred to Section 15.
Lemma 15. Consider come from either non-symmetric tensor estimation model (VI.1) or symmetric tensor estimation model (III.1). Suppose Conditions 3–5 hold. Then is upper and lower bounded by
with probability at least 1 − (K2 + K + 3)/n, where Γ is the incoherence parameter defined in Definition 3.
According to Lemma 15, approximates up to some constants with high probability. Moreover, we know that from (A.13), for some small ε0. Based on those two facts described above, we replace ηk by and ϕ by for the sake of completeness. Note that this change could only result in some constant scale changes for final results. Similar simplification was used in matrix recovery scenario [62]. Therefore, we define the weighted estimator and weighted true parameter as , . Now, . Recall · is the loss function defined in (III.4). Correspondingly with a slight abuse of notation, define the gradient function on F as
and its noiseless version as
| (A.42) |
According to the definition of thresholding function (III.8), can be written as
where satisfies and is defined as
| (A.43) |
Moreover, we denote . With a little abuse of notations, we also drop the subscript F in this section for notation simplicities.
We expand and decompose the sum of square error by three parts as follows:
| (A.44) |
In the following proof, we will bound three parts sequentially.
1). Bounding gradient update effect:
In order to separate the optimization error and statistical error, we use the noiseless gradient as a bridge such that A can be decomposed as
| (A.45) |
where A1 and A2 quantify the optimization error, A3 quantifies the statistical error, and A4 is a cross term which can be negligible comparing with the rate of the statistical error. The lower bound for A1 and upper bound for A2 together coincide with the verification of regularity conditions in the matrix recovery case [52].
Step One: Lower bound for A1.
Plugging in , we have
| (A.46) |
According to the definition of noiseless gradient and zk, A1 can be expanded and decomposed sequentially by nine terms,
| (A.47) |
| (A.48) |
where A11 is the main term according to the order of , while A12 to A19 are remainder terms. The proof of lower bound for A11 to A19 follows two steps:
Calculate and lower bound the expectation of each term through Lemma A.2: high-order Gaussian moment;
Argue that the empirical version is concentrated around their expectation with high probability through Lemma 1: high-order concentration inequality.
Bounding A11. Note that A11 involves the product of dependent Gaussian vectors. This brings difficulties on both the calculation of expectations and the use of concentration inequality. According to the high-order Gaussian moment results in Lemma A.2, the expectation of A11 can be calculated explicitly as
| (A.49) |
Note that I1 to I4 involve the summation of K2 term. To use incoherence Condition 3, we isolate K terms with k = k′. Then, I1 to I4 could be lower bounded as
where Γ is the incoherence parameter. Putting the above four bounds together, they jointly provide
| (A.50) |
On the other hand, repeatedly using Lemma 1, we obtain that with probability at least 1 − 1/n,
Taking the summation over k,k′ ∈ [K], it could further imply that for some absolute constant C,
| (A.51) |
with probability at least 1 − K2/n.. Combining (A.50) and (A.51), we obtain with probability at least 1 − K2/n,
| (A.52) |
Where . Here, we use the fact Γ ≤ 1 and .
Bounding A12 to A19: For remainder terms, we follow the same proof strategy. According to Lemma A.2, the expectation of A12 can be calculated as
Let us analyze I1 first. Under (A.13), , it suffices to show that
This immediately implies a lower bound for after we bound similarly for I2,I3 and I4,
| (A.53) |
By Lemma 1, we obtain for some absolute constant C,
| (A.54) |
with probability at least 1−K2/n. The detail derivation is the same as in (A.52), so we omit here.
Similarly, the lower bounds of A13 to A19 can be derived as follows
| (A.55) |
Putting (A.52), (A.54) and (A.55) together, we have with probability at least 1 − 9K2/n,
For the above bound,
- When the sample size satisfies
we have - When ε0 ≤ K−1R−2/2160, we have
- When the incoherence parameter satisfies Γ ≤ K−1/2/216, we have
Note that those above conditions can be fulfilled by Conditions 3, 5 and (A.13). Thus, we are able to simplify A1 by
| (A.56) |
with probability at least 1 − 9K2/n.
Step Two: Upper bound for A2.
We observe the fact that
| (A.57) |
where is a unit sphere. It is equivalent to show for any is upper bounded. According to the definition of noiseless gradient (A.42), is explicitly written as
Following by (A.46) and (A.48), similar decomposition can be made for as follows, where the only difference is that we replace one by .
Let’s bound first. By using the same technique when calculating in (A.49), we derive an upper bound for .
Equipped with Lemma 2 and the definition of tensor spectral norm (II.3), it suffices to bound by
with probability at least 1−10K2/n3, where δn,p,s is defined in (IV.7).
The upper bounds for to follow similar forms. Combining them together, we can derive an upper bound for as follows
with probability at least 1 − 90K2/n3, where the second inequality utilizes Condition 5. Therefore, the upper bound of A2 is given as follows
| (A.58) |
with probability at least 1 − 90K2/n3.
Step Three: Upper bound for A3.
By the definition of noisy gradient and noiseless gradient, A3 is explicitly written as
where the second inequality comes from (A.46). For fixed , applying Lemma 1, we have
with probability at least 1 − 1/n. Together with Lemma 23, we obtain for any j ∈ [Ks],
with probability at least 1 − 4/n, where σ is the noise level. According to (A.13),
which further implies . Equipped with union bound over j ∈ [Ks],
with probability at least 1 − 4Ks/n. Letting ,
| (A.59) |
with probability at least 1 − 4Ks/n.
Step Four: Upper bound for A4.
This cross term can be written as
To bound this term, we take the same step in Step Three which fixes the noise term first. Similarly, we obtain with probability at least 1 − 4K/n,
| (A.60) |
This term is negligible in terms of the order when comparing with (A.59).
Summary. Putting the bounds (A.56), (A.58), (A.59) and (A.60) together, we achieve an upper bound for gradient update effect as follows,
| (A.61) |
with probability at least 1 − (18K2 + 4K + 4Ks)/n. ■
2). Bounding thresholding effect:
The thresholding effect term in (A.44) can also be decomposed into optimization error and statistical error. Recall that B can be explicitly written as
where supp(γk) ⊂ Fk and ‖γk‖∞ ≤ 1. By using (a + b)2 ≤ 2(a2 + b2), we have
where
Bounding B1. This optimization error term shares similar structure with (A.57) but with higher order. Therefore, we follow the same idea as we did in bounding (A.57). Following by (A.46) and some basic expansions and inequalities,
The main term is according to the order of . We bound the main term first. Note that there exists some positive large constant C such that
Together with Lemma 1 and (A.13), we have
with probability at least 1 − 3K2/n. Overall, the upper bound of B1 takes the form
| (A.62) |
with probability at least 1 − 3K2/n.
Bounding B2. We rewrite B2 by
For fixed , accordingly to Lemma 1, we have
Note that . It will reduce to
From Lemma 23, with probability at least 1 − 3/n,
Combining the above two inequalities, we obtain
| (A.63) |
with probability at least. Plugging in the definition of ϕ and (A.13), B2 is upper bounded by
| (A.64) |
with probability at least 1 − 7K/n.
Summary. Putting the bounds (A.62) and (A.64) together, we have similar upper bound for thresholded effect,
| (A.65) |
with probability at least 1 − (3K2 + 7K)/n. ■
3). Ensemble:
From the definition of γk, it’s not hard to see actually the cross term C is equal to zero. Combining the upper bound of gradient update effect (A.61) and thresholding effect (A.65) together, we obtain
As long as the step size μ satisfies
we reach the conclusion
| (A.66) |
with probability at least 1 − 4Ks/n.
M. Proof of Lemma 14
Let us consider k-th component first. Without loss of generality, suppose F ⊂ {1,2,…,Ks}. For j = Ks + 1,…,p,
| (A.67) |
and it’s not hard to see the independence between and xij. Applying standard Hoeffding’s inequality, we have with probability at least ,
Equipped with union bound, with probability at least ,
Therefore, according to the definition of thresholding function φ(x), we obtain the following equivalence,
| (A.68) |
holds for k ∈ [K], with probability at least . (A.68) also provides that for every k ∈ [K], which further implies F+ ⊂ F. Now we end the proof. ■
N. Proof of Lemma 15
First, we consider symmetric case. According to the definition of from symmetric tensor estimation model (III.1), we separate the random noise ϵi by the following expansion,
| (A.69) |
Bounding I1. We expand i-th component of I1 as follows.
| (A.70) |
As shown in Corollary A.2, the expectations of above two parts takes forms of
Recall that for any k ∈ [K] and Condition 3 implies for any ki ≠ kj, , where Γ is the incoherence parameter. Thus, is upper bounded by
| (A.71) |
By using the concentration result in Lemma 1, we have with probability at least 1 − 1/n
| (A.72) |
Putting (A.70),(A.71) and (A.72) together, this essentially provides an upper bound for I1, namely
| (A.73) |
with probability at least 1 − K2/n.
Bounding I2. Since the random noise is of mean zero and independent of {xi}, we have
By using the independence and Corollary 1, we have
This further implies that
| (A.74) |
with probability at least 1 − 4K/n.
Bounding I3. As shown in Lemma 23, the random noise ϵi with sub-exponential tail satisfies
| (A.75) |
with probability at least 1 − 3/n.
Overall, putting (A.73), (A.74) and (A.75) together, we have with probability at least 1 − (K2 + 4K + 3)/n,
Under Conditions 4 & 5, the above bound reduces to
with probability at least 1 − (K2 + 4K + 3)/n. The proof of lower bound is similar, and hence is omitted here.
Similar results will also hold for non-symmetric tensor estimation model. Throughout the proof, the only difference is that
O. Non-symmetric Tensor Estimation
1). Conditions and Algorithm:
In this subsection, we provide several essential conditions for Theorem 5 and the detail algorithm for non-symmetric tensor estimation.
Condition 6 (Uniqueness of CP-decomposition). The CP-decomposition form (VI.2) is unique in the sense that if there exists another CP-decomposition , it must have K = K′ and be invariant up to a permutation of {1,…,K}.
Condition 7 (Parameter space). The CP-decomposition of satisfies
for some absolute constants C1,C2.
Condition 8 (Parameter incoherence). The true tensor components are incoherent such that
Condition 9 (Random noise). We assume the random noise follows a sub-exponential tail with parameter σ satisfying .
|
2). Proof of Theorem 5:
The main distinguished part of the proof for non-symmetric update is Lemma 16: one-step oracle estimator, which is parallel to Lemma 11. For the sake of completeness, we limit our attention to rank-one case and only provide the theoretical development for one-step oracle estimator in this subsection. The generalization to general rank case follows the exact same idea in the proof of symmetric update by incorporating the incoherence condition (8).
For rank-one non-symmetric tensor estimation, the model (VI.1) reduces to
Suppose , , and denote s = max{s1,s2,s3}. Define , and the oracle estimator as
where has the form of
| (A.76) |
The definitions of and are similar.
Lemma 16. Let t ≥ 0 be an integer. Suppose Conditions 6–9 hold and satisfies the following upper bound
| (A.77) |
with probability at least 1 − CO(1/n). Assume the step size μ satisfies 0 < μ < μ0 for some small absolute constant μ0 and s ≤ d ≤ Cs. Then can be upper bounded as
with probability at least 1 − 12s/n.
Proof. We focus on j = 1 first. To simplify the notation, we drop the superscript of iteration index t, and denote iteration index t+1 by +. Moreover, denote , for j = 1, 2, 3. Then, the gradient function is rewritten as
According to the definition of thresholded function, can be explicitly written by
where , supp(γ) ⊂−F and ‖γ‖∞ ≤ 1. Then the oracle estimation error can be decomposed by the gradient update effect and the thresholded effect,
| (A.78) |
By using the tri-convex structure of , we borrow the analysis tool for vanilla gradient descent [55] given sufficient good initial. Following this proof strategy, we decompose the gradient update effect in (A.78) by three parts,
where is the noiseless gradient as we defined in (A.42). We will bound I1, I2, I3, I4 successively in the following four subsections. For simplicity, during the following proof, we drop the index subscript F as we did in Section L. And approximates η*2 up to constant due to Lemma 15.
3). Bounding I1:
In this section, let us denote
| (A.79) |
Where . When β2 and β3 are fixed, the update can be treated as a vanilla gradient descent update. The following proof follows three steps. The first two steps show that is Lipshitz differentiable and strongly convex on the constraint set F, and the last step utilizes the classical convex gradient analysis.
Step One:
Verify is L-Lipschitz differentiable. For any and whose support belong to F,
Then, there exist such that
Applying Lemma 2 with multiplying it shows
with probability at least 1 − 10/n3, where δn,p,s is defined in (IV.7). Under Condition (5) with some constant adjustments, we obtain
| (A.80) |
with probability at least 1 − 10/n3. Therefore, is Lipschitz differentiable with Lipschitz constant .
Step Two:
Verify is α-strongly convex. It is equivalent to prove that . Based on the inequality (3.3.19) in [63], it shows that
| (A.81) |
The lower bound of breaks into two parts: an lower bound for , and an upper bound for . The Hessian matrix of is given by
Since ui,vi,wi are independent with each other, we have , which implies . On the other hand,
where . Equipped with Lemma 2, it yields that with probability at least 1 − 10/n3,
Together with the lower bound of , we have
Under Condition 5, the minimum eigenvalue of Hessian matrix is lower bounded by with probability at least 1−10/n3. This guarantees that is strongly-convex with .
Step Three:
Combining the Lipschitz condition, strongly-convexity and Lemma 3.11 in [55], it shows that
Since the gradient vanishes at the optimal point, the above inequality times 2μ simplifies to
| (A.82) |
Now it’s sufficient to bound as follows
where L,α are Lipschitz constant and strongly convexity parameter, respectively. If , the last term can be neglected and we obtain the desired upper bound,
| (A.83) |
with probability 1 − 20/n3. This ends the proof. ■
4). Bounding I2:
For simplicity, we write , , . By the definition of noiseless gradient, it suffices to decompose I2 by
Repeatedly using Lemma 2, we obtain
for sufficiently small ε0 with probability at least 1 − 60/n3. Under Condition 5, it suffices to get
| (A.84) |
with probability at least 1 − 6/n.
5). Bounding I3:
I3 quantifies the statistical error. By the definition of noiseless gradient and noisy gradient, we have
The proof of this part essentially coincides with the proof for symmetric tensor estimation. Combining Lemmas 1 and 23, we have
with probability at least 1 − 4/n. Applying union bound over 3s coordinates, it suffices to get
Therefore, we reach
with probability at least 1 − 12s/n.
6). Bounding I4:
According to the definition of thresholding level h(β1) in (A.76), we can bound the square as follows,
Based on the basic inequality (a + b)2 ≤ 2(a2 + b2), we have
Denote I1 and I2 corresponding to optimization error and statistical error,
Next, I1 is decomposed by some high-order polynomials as follows
| (A.85) |
Each term contains the product of Gaussian random vectors form up to power ten. For the first term, by using Lemma 1,
with probability at least 1−1/n. Similar bounds holds for other terms. As long as n ≥ C log10 n, we have with probability at least 1 − 7/n,
| (A.86) |
Now we turn to bound I2. For fixed {ϵi}, we have,
with probability at least 1 − n−1. Combining with Lemma 23,
| (A.87) |
Putting (A.86) and (A.87) together, the thresholded effect can be bound by
| (A.88) |
with probability at least 1 − 8/n, provided n ≳ (logn)10. ■
7). Summary:
Putting the upper bounds (A.83), (A.84) and (A.88) together, we obtain that if step size μ satisfies 0 < μ < μ0 for some small μ0,
with probability at least 1−12s/n. This finishes our proof. ■
P. Matrix Form Gradient and Stochastic Gradient descent
1). Matrix Formulation of Gradient:
In this section, we provide detail derivations for (III.7) and (VI.5).
Lemma A.1. Let and . The gradient of symmetric tensor estimation empirical risk function (III.5) can be written in a matrix form as follows
Proof. First let’s have a look at the gradient for k-th component,
for k = 1,…,K. Correspondingly, each part can be written as a matrix form,
This implies that . Note that . The conclusion can be easily derived. ■
Lemma 17. Let , . The gradient of non-symmetric tensor estimation empirical risk function (VI.3) can be written in a matrix form as follows
where em and .
Proof. Recall that {*,⊙} represent Hadamard product and Khatri-Rao product respectively. Then the dimensionality of D,C1,C1 ⊙ U can be calculated as follows
Therefore,
2). Stochastic Gradient descent:
Stochastic thresholded gradient descent is a stochastic approximation of the gradient descent optimization method. Note that the empirical risk function (III.5) that can be written as a sum of differentiable functions. Followed by (III.7), the gradient of (III.5) evaluated at i-th sketching {yi,xi} can be written as
Thus, the overall gradient defined in (III.7) can be expressed as a summand of ,
The thresholded step remains the same as Step 3 in Algorithm1. Then the symmetric update of stochastic thresholded gradient descent within one iteration is summarized by
Q. Technical Lemmas
Lemma 18. Suppose is a standard Gaussian random vector. For any non-random vector , we have the following tensor expectation calculation,
| (A.89) |
where em is a canonical vector in .
Proof. Recall that for a standard Gaussian random variable x, its odd moments are zero and even moments are , . Expanding the LHS of (A.89) and comparing LHS and RHS, we will reach the conclusion. Details are omitted here. ■
Lemma 19. Suppose , , are independent standard Gaussian random vectors. For any non-random vector , , , we have the following tensor expectation calculation
| (A.90) |
Proof. Due to the independence among u,v,w, the conclusion is easy to obtain by using the moment of standard Gaussian random variable. ■
Note that in the left side of (A.89), it involves an expectation of rank-one tensor. When multiplying any non-random rank-one tensor with same dimensionality, i.e., a1 ∘ b1 ∘ c1, on both sides, it will facilitate us to calculate the expectation of product of Gaussian vectors, see next Lemma for details.
Lemma A.2. Suppose is a standard Gaussian random vector. For any non-random vector , we have the following expectation calculation
Proof. Note that Then we can apply the general result in Lemma 18. Comparing both sides, we will obtain the conclusion. Others part follows the similar strategy. ■
Next lemma provides a probabilistic concentration bound for non-symmetric rank-one tensor under tensor spectral norm.
Lemma 20. Suppose , , are three n × p random matrices. The ψ2-norm of each entry is bounded, s.t. ‖Xij‖ψ2 = Kx,‖Yij‖ψ2 = Ky,‖Zij‖ψ2 = Kz. We assume the row of X,Y, Z are independent. There exists an absolute constant C such that,
Here, ‖ · ‖s is the sparse tensor spectral norm defined in (II.3) and .
Proof. Bounding spectral norm always relies on the construction of the ϵ-net. Since we will bound a sparse tensor spectral norm, our strategy is to discrete the sparse set and construct the ϵ-net on each one. Let us define a sparse set . And let be the s-dimensional set defined by . Note that is corresponding to s-sparse unit vector set which can be expressed as a union of subsets of dimension s by expanding some zeros, namely . There should be at most < such set.
Recalling the definition of sparse tensor spectral norm in (II.3), we have
Instead of constructing the ϵ-net on , we will construct an ϵ-net for each of subsets . Define as the 1/2-set of . From Lemma 3.18 in [64], the cardinality of is bounded by 5s. By Lemma 21, we obtain
| (A.91) |
By rotation invariance of sub-Gaussian random variable, , , are still sub-Gaussian random variables with ψ2-norm bounded by Kx,Ky,Kz, respectively. Applying Lemma 1 and union bound over , the right hand side of (A.91) can be bounded by
with probability smaller than (5s)3δ for any 0 < δ < 1.
Lastly, taking the union bound over all possible subsets yields that
Letting , we obtain with probability at least 1 − 1/p,
with some adjustments on constant C. The proof for symmetric case is similar to non-symmetric case so we omit here.
Lemma 21 (Tensor Covering Number(Lemma 4 in [65])). Let be an ϵ-net for a set B associated with a norm ‖ · ‖. Then, the spectral norm of a d-mode tensor is bounded by
This immediately implies that the spectral norm of a d-mode tensor is bounded by
where is the ϵ-net for the unit sphere in .
Lemma 22 (Sub-Gaussianess of the Product of Random Variables). Suppose X1 is a bounded random variable with |X1| ≤ K1 almost surely for some K1 and X2 is a sub-Gaussian random variable with Orlicz norm ‖X2‖ψ2K2. Then X1X2 is still a sub-Gaussian random variable with Orlicz norm ‖X1X2‖ψ2 = K1K2.
Proof: Following the definition of sub-Gaussian random variable, we have
holds for all t ≥ 0. This ends the proof. ■
Lemma 23 (Tail Probability for the Sum of Sub-exponential Random Variables (Lemma A.7 in [48])). Suppose ϵ1,…, ϵn are independent centered sub-exponential random variables with
Then with probability at least 1 − 3/n, we have
for some constant C0
Lemma 24 (Tail Probability for the Sum of Weibull Distributions (Lemma 3.6 in [34])). Let α ∈ [1,2] and Y1,…,Yn be independent symmetric random variables satisfying. Then for every vector and every t ≥ 0,
Proof. It is a combination of Corollaries 2.9 and 2.10 in [58].
Lemma 25 (Moments for the Sum of Weibull Distributions (Corollary 1.2 in [66])). Let X1,X2,…,Xn be a sequence of independent symmetric random variables satisfying , where 0 < α < 1. Then, for p ≥ 2 and some constant C(α) which depends only on α,
Lemma 26 (Stein’s Lemma [56]). Let be a random vector with joint density function p(x). Suppose the score function ∇x logp(x) exists. Consider any continuously differentiable function . Then, we have
Lemma 27 (Khinchin-Kahane Inequality (Theorem 1.3.1 in a finite non-random sequence, be a sequence of independent Rademacher variables and 1 < p < q < ∞. Then
Lemma 28. Suppose each non-zero element of is drawn from standard Gaussian distribution and ‖xk‖0 ≤ s for k ∈ [K]. Then we have for any 0 < δ ≤ 1,
where C is some constant.
Proof. Let us denote as an index set such that for any , we have and . From the definition of , we know that and . We apply standard Hoeffding’s concentration inequality,
Letting ct2/s = log(1/δ), we reach the conclusion.
Contributor Information
Botao Hao, Department of Electrical Engineering, Princeton University, Princeton, NJ 08540,.
Anru Zhang, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706,.
Guang Cheng, Department Statistics, Purdue University, West Lafayette, IN 47906,.
References
- [1].Kroonenberg PM, Applied Multiway Data Analysis. Wiley Series in Probability and Statistics, 2008. [Google Scholar]
- [2].Kolda T and Bader B, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009. [Google Scholar]
- [3].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Li X, Xu D, Zhou H, and Li L, “Tucker tensor regression and neuroimaging analysis,” Statistics in Biosciences, vol. 10, no. 3, pp. 520–545, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Sun WW and Li L, “Store: sparse tensor response regression and neuroimaging analysis,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 4908–4944, 2017. [Google Scholar]
- [6].Caiafa CF and Cichocki A, “Multidimensional compressed sensing and their applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3, no. 6, pp. 355–380, 2013. [Google Scholar]
- [7].Friedland S, Li Q, and Schonfeld D, “Compressive sensing of sparse tensors,” IEEE Transactions on Image Processing, vol. 23, no. 10, pp. 4438–4447, 2014. [DOI] [PubMed] [Google Scholar]
- [8].Liu J, Musialski P, Wonka P, and Ye J, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 208–220, 2013. [DOI] [PubMed] [Google Scholar]
- [9].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
- [10].Yuan M and Zhang C-H, “Incoherent tensor norms and their applications in higher order tensor completion,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6753–6766, 2017. [Google Scholar]
- [11].Zhang A, “Cross: Efficient low-rank tensor completion,” The Annals of Statistics, vol. 47, no. 2, pp. 936–964, 2019. [Google Scholar]
- [12].Montanari A and Sun N, “Spectral algorithms for tensor completion,” Communications on Pure and Applied Mathematics, vol. 71, no. 11, pp. 2381–2425, 2018. [Google Scholar]
- [13].Ghadermarzy N, Plan Y, and Yilmaz Ö, “Near-optimal sample complexity for convex tensor completion,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 577–619, 2018. [Google Scholar]
- [14].Zhang Z and Aeron S, “Exact tensor completion using t-svd,” IEEE Transactions on Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2016. [Google Scholar]
- [15].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]
- [16].Raskutti G, Yuan M, Chen H, et al. , “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statistics, vol. 47, no. 3, pp. 1554–1584, 2019. [Google Scholar]
- [17].Chen H, Raskutti G, and Yuan M, “Non-convex projected gradient descent for generalized low-rank tensor regression,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 172–208, 2019. [Google Scholar]
- [18].Li L and Zhang X, “Parsimonious tensor response regression,” Journal of the American Statistical Association, pp. 1–16, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Zhang A, Luo Y, Raskutti G, and Yuan M, “Islet: Fast and optimal low-rank tensor regression via importance sketching,” arXiv preprint arXiv:1911.03804, 2019. [Google Scholar]
- [20].Romera-Paredes B, Aung MH, Bianchi-Berthouze N, and Pontil M, “Multilinear multitask learning,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp. III–1444–III–1452, JMLR.org, 2013. [Google Scholar]
- [21].Bien J, Taylor J, Tibshirani R, et al. , “A lasso for hierarchical interactions,” The Annals of Statistics, vol. 41, no. 3, pp. 1111–1141, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Hao N and Zhang HH, “Interaction screening for ultrahigh-dimensional data,” Journal of the American Statistical Association, vol. 109, no. 507, pp. 1285–1301, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Fan Y, Kong Y, Li D, and Lv J, “Interaction pursuit with feature screening and selection,” arXiv preprint arXiv:1605.08933, 2016. [Google Scholar]
- [24].Basu S, Kumbier K, Brown JB, and Yu B, “Iterative random forests to discover predictive and stable high-order interactions,” Proceedings of the National Academy of Sciences, p. 201711236, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing, pp. 517–520, IEEE, 2010. [Google Scholar]
- [26].Vasilescu MAO and Terzopoulos D, “Multilinear subspace analysis of image ensembles,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, pp. II–93, IEEE, 2003. [Google Scholar]
- [27].Sun WW, Lu J, Liu H, and Cheng G, “Provable sparse tensor decomposition,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 79, no. 3, pp. 899–916, 2017. [Google Scholar]
- [28].Rauhut H, Schneider R, and Stojanac Ž, “Low rankˇ tensor recovery via iterative hard thresholding,” Linear Algebra and its Applications, vol. 523, pp. 220–262, 2017. [Google Scholar]
- [29].Li X, Haupt J, and Woodruff D, “Near optimal sketching of low-rank tensor regression,” in Advances in Neural Information Processing Systems, pp. 3466–3476, 2017. [Google Scholar]
- [30].Wang Z, Liu H, and Zhang T, “Optimal computational and statistical rates of convergence for sparse nonconvex learning problems,” The Annals of statistics, vol. 42, no. 6, p. 2164, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Loh P-L and Wainwright MJ, “Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” Journal of Machine Learning Research, vol. 16, pp. 559–616, 2015. [Google Scholar]
- [32].Cai TT and Zhang A, “Rop: Matrix recovery via rank-one projections,” The Annals of Statistics, vol. 43, no. 1, pp. 102–138, 2015. [Google Scholar]
- [33].Chen Y, Chi Y, and Goldsmith AJ, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015. [Google Scholar]
- [34].Adamczak R, Litvak AE, Pajor A, and Tomczak-Jaegermann N, “Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling,” Constructive Approximation, vol. 34, no. 1, pp. 61–88, 2011. [Google Scholar]
- [35].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
- [36].Keshavan RH, Montanari A, and Oh S, “Matrix completion from a few entries,” IEEE Transactions on Information Theory, vol. 56, no. 6, pp. 2980–2998, 2010. [Google Scholar]
- [37].Koltchinskii V, Lounici K, and Tsybakov AB, “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion,” The Annals of Statistics, pp. 2302–2329, 2011. [Google Scholar]
- [38].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems 27 (Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, eds.), pp. 2897–2905, Curran Associates, Inc., 2014. [Google Scholar]
- [39].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, pp. 7311–7338, November 2018. [Google Scholar]
- [40].Mu C, Huang B, Wright J, and Goldfarb D, “Square deal: Lower bounds and improved relaxations for tensor recovery,” in Proceedings of the 31st International Conference on Machine Learning (Xing EP and Jebara T, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China: ), pp. 73–81, PMLR, 22–24 June 2014. [Google Scholar]
- [41].Friedland S and Lim L-H, “Nuclear norm of higher-order tensors,” Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281, 2018. [Google Scholar]
- [42].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013. [Google Scholar]
- [43].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]
- [44].Donoho DL, “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006. [Google Scholar]
- [45].Chandrasekaran V, Sanghavi S, Parrilo PA, and Willsky AS, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011. [Google Scholar]
- [46].Arora S, Ge R, and Moitra A, “New algorithms for learning incoherent and overcomplete dictionaries,” in Proceedings of The 27th Conference on Learning Theory (Balcan MF, Feldman V, and Szepesvri C, eds.), vol. 35 of Proceedings of Machine Learning Research, (Barcelona, Spain: ), pp. 779–806, PMLR, 13–15 June 2014. [Google Scholar]
- [47].Zhang Y, Chen X, Zhou D, and Jordan MI, “Spectral methods meet em: A provably optimal algorithm for crowd-sourcing,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3537–3580, 2016. [Google Scholar]
- [48].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016. [Google Scholar]
- [49].Janzamin M, Sedghi H, and Anandkumar A, “Score function features for discriminative learning: matrix and tensor framework,” arXiv preprint arXiv:1412.2863, 2014. [Google Scholar]
- [50].Anandkumar A, Ge R, and Janzamin M, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv preprint arXiv:1402.5180, 2014. [Google Scholar]
- [51].Cai C, Li G, Poor HV, and Chen Y, “Nonconvex low-rank symmetric tensor completion from noisy data,” arXiv preprint arXiv:1911.04436, 2019. [Google Scholar]
- [52].Candès EJ, Li X, and Soltanolkotabi M, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61, pp. 1985–2007, April 2015. [Google Scholar]
- [53].Hung H, Lin Y-T, Chen P, Wang C-C, Huang S-Y, and Tzeng J-Y, “Detection of gene–gene interactions using multistage sparse and low-rank regression,” Biometrics, vol. 72, no. 1, pp. 85–94, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Sidiropoulos ND and Kyrillidis A, “Multi-way compressed sensing for sparse low-rank tensors,” IEEE Signal Processing Letters, vol. 19, no. 11, pp. 757–760, 2012. [Google Scholar]
- [55].Bubeck S, Foundations and Trends in Machine Learning, ch. Convex Optimization: Algorithms and Complexity, pp. 231–357. 2015. [Google Scholar]
- [56].Stein C, Diaconis P, Holmes S, Reinert G, et al. , “Use of exchangeable pairs in the analysis of simulations,” in Stein’s Method, pp. 1–25, Institute of Mathematical Statistics, 2004. [Google Scholar]
- [57].Hitczenko P, Montgomery-Smith S, and Oleszkiewicz K, “Moment inequalities for sums of certain independent symmetric random variables,” Studia Math, vol. 123, no. 1, pp. 15–42, 1997. [Google Scholar]
- [58].Talagrand M, “The supremum of some canonical processes,” American Journal of Mathematics, vol. 116, no. 2, pp. 283–325, 1994. [Google Scholar]
- [59].Vershynin R, Compressed sensing, ch. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge Univ. Press, 2012. [Google Scholar]
- [60].Yu B, “Assouad, fano, and le cam,” Festschrift for Lucien Le Cam, vol. 423, p. 435, 1997. [Google Scholar]
- [61].Ledoux M and Talagrand M, Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. [Google Scholar]
- [62].Tu S, Boczar R, Simchowitz M, Soltanolkotabi M, and Recht B, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning (Balcan MF and Weinberger KQ, eds.), vol. 48 of Proceedings of Machine Learning Research, (New York, New York, USA: ), pp. 964–973, PMLR, 20–22 June 2016. [Google Scholar]
- [63].Horn RA and Johnson CR, Matrix Analysis. New York: Cambridge Univ. Press, 1988. [Google Scholar]
- [64].Ledoux M, The concentration of measure phenomenon. No. 89, American Mathematical Soc., 2005. [Google Scholar]
- [65].Nguyen NH, Drineas P, and Tran TD, “Tensor sparsification via a bound on the spectral norm of random tensors,” Information and Inference: A Journal of the IMA, vol. 4, no. 3, pp. 195–229, 2015. [Google Scholar]
- [66].Bogucki R, “Suprema of canonical weibull processes,” Statistics & Probability Letters, vol. 107, pp. 253–263, 2015. [Google Scholar]
- [67].De la Pena V and Gine E,´ Decoupling: from dependence to independence. Springer Science & Business Media, 2012. [Google Scholar]
