Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Botao Hao; Anru Zhang; Guang Cheng

doi:10.1109/tit.2020.2982499

. Author manuscript; available in PMC: 2021 Sep 1.

Published in final edited form as: IEEE Trans Inf Theory. 2020 Mar 23;66(9):5927–5964. doi: 10.1109/tit.2020.2982499

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Botao Hao ¹, Anru Zhang ², Guang Cheng ³

PMCID: PMC7978041 NIHMSID: NIHMS1621480 PMID: 33746244

Abstract

In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.

I. Introduction

The rapid advance in modern scientific technology gives rise to a wide range of high-dimensional tensor data [1, 2]. Accurate estimation and fast communication/processing of tensor-valued parameters are crucially important in practice. For example, a tensor-valued predictor which characterizes the association between brain diseases and scientific measurements becomes the point of interest [3, 4, 5]. Another example is the tensor-valued image acquisition algorithm that can considerably reduce the number of required samples by exploiting the compressibility property of signals [6, 7].

The following tensor estimation model is widely considered in recent literatures,

y_{i} = 〈 J^{*}, X_{i} 〉 + ϵ_{i}, i = 1, \dots, n .

(I.1)

Here, $X_{i}$ and ϵ_i are the measurement tensor and the noise, respectively. The goal is to estimate the unknown tensor $J^{*}$ from measurements ${y_{i}, X_{i}}_{i = 1}^{n}$ . A number of specific settings with varying forms of $X_{i}$ have been studied, e.g., tensor completion [8, 9, 10, 11, 12, 13, 14, 15], tensor regression [5, 3, 4, 16, 17, 18, 19], multi-task learning [20], etc.

In this paper, we focus on the case that the measurement tensor can be written in a cubic sketching form. For example, $X_{i} = x_{i} \circ x_{i} \circ x_{i}$ or $X_{i} = u_{i} \circ v_{i} \circ w_{i}$ , depending on whether $J^{*}$ is symmetric or not. The cubic sketching form of $X_{i}$ is motivated by a number of applications.

Interaction effect estimation: High-dimensional high-order interaction models have been considered under a variety of settings [21, 22, 23, 24]. By writing $X_{i} = x_{i} \circ x_{i} \circ x_{i}$ , we find that the interaction model has an interesting tensor representation (see left panel of Figure 1) which allows us to estimate high-order interaction terms using tensor techniques. This is in contrast with the existing literature that mostly focused on pair-wise interactions due to the model complexity and computational difficulties. More detailed discussions will be provided in Section V.
High-order imaging/video compression: High-order imaging/video compression is an important task in modern digital imaging with various applications (see right panel of Figure 1), such as hyper-spectral imaging analysis [25] and facial imaging recognition [26]. One could use Gaussian ensembles for compression such that each entry of $X_{i}$ is i.i.d. randomly generated [3, 16, 17]. In contrast, the non-symmetric cubic sketchings, i.e., $X_{i} = u_{i} \circ v_{i} \circ w_{i}$ , reduce the memory storage from O(np₁p₂p₃) to O(n(p₁ +p₂ +p₃)) (n is the sample size and (p₁,p₂,p₃) is the tensor dimension), but still preserve the optimal statistical rate. More detailed discussions will be provided in Section VI.

In practice, the total number of measurements n is considerably smaller than the number of parameters in the unknown tensor $J^{*}$ , due to all kinds of restrictions such as time and storage. Fortunately, a variety of high-dimensional tensor data possess intrinsic structures, such as low-rankness [2] and sparsity [27]. This could highly reduce the effective dimension of the parameter and make the accurate estimation possible. Please refer to (III.2) and (VI.2) for low-rankness and sparsity assumptions.

Fig. 1. — Illustration for interaction reformulation and tensor image/video compression

In this paper, we propose a computationally efficient non-convex optimization approach for sparse and low-rank tensor estimation via cubic-sketchings. Our procedure is two-stage:

obtain an initial estimate via the method of tensor moment (motivated by high-order Stein’s identity), and then apply sparse tensor decomposition to the initial estimate to output a warm start;
use a thresholded gradient descent to iteratively refine the warm start in each tensor mode until convergence.

Theoretically, we carefully characterize the optimization and statistical errors at each iteration step. The output estimate is shown to converge in a geometric rate to an estimation with minimax optimal rate in statistical error (in terms of tensor Frobenius norm). In particular, after a logarithmic number of iterations, whenever $n ≳ K^{2} {(s log (e p / s))}^{\frac{3}{2}}$ , the proposed estimator $\hat{J}$ achieves

{‖ \hat{J} - J^{*} ‖}_{F}^{2} \leq C σ^{2} \frac{K s log (p / s)}{n}

(I.2)

with high probability, where s, K, p, and σ² are the sparsity, rank, dimension, and noise level, respectively. We further establish the matching minimax lower bound to show that (I.2) is indeed optimal over a large class of sparse low-rank tensors. Our optimality result can be further extended to the non-sparse case (such as tensor regression [3, 17, 28, 29]) – to the best of our knowledge, this is the first statistical rate optimality result in both sparse and non-sparse low-rank tensor regressions.

The above theoretical analyses are non-trivial due to the non-convexity of the empirical risk function, and the need to develop some new high-order sub-Gaussian concentration inequalities. Specifically, the empirical risk function in consideration satisfies neither restricted strong convexity (RSC) condition nor sparse eigenvalue (SE) condition in general. Thus, many previous results, such as the one based on local optima analysis [30, 31, 17], are not directly applicable. Moreover, the structure of cubic-sketching tensor leads to high-order products of sub-Gaussian random variables. Thus, the matrix analysis based on Hoeffding-type or Bernstein-type concentration inequality [32, 33] will lead to sub-optimal statistical rate and sample complexity. This motivates us to develop new high-order concentration inequalities and sparse tensor-spectral-type bound, i.e., Lemmas 1 and 2 in Section IV-C. These new technical results are obtained based on the careful partial truncation of high-order products of sub-Gaussian random variables and the argument of bounded ψ_α-norm [34], and may be of independent interest.

The literature on low-rank matrix estimation methods, e.g., the spectral method and nuclear norm minimization [35, 36, 37], is also related to this work. However, our cubic sketching model is by-no-means a simple extension from matrix estimation problems. In general, many related concepts or methods for matrix data, such as singular value decomposition, are problematic to apply in the tensor framework [38, 39]. It is also found that simple unfolding or matricizing of tensors may lead to suboptimal results due to the loss of structural information [40]. Technically, the tensor nuclear norm is NP-hard to even approximate [9, 10, 41], and thus the method to handle tensor low-rankness is distinct from the matrix.

The rest of the paper is organized as follows. Section II provides preliminaries on notation and basic knowledge of tensor. A two-stage method for symmetric tensor estimation is proposed in Section III, with the corresponding theoretical analysis given in Section IV. A concrete application to high-order interaction effect models is described in Section V. The non-symmetric tensor estimation model is introduced and discussed in Section VI. Numerical analysis is provided in Section VII to support the proposed procedure and theoretical results of this paper. Section VIII discusses extensions to higher-order tensors. The proofs of technical results are given in supplementary materials.

II. Preliminary

Throughout the paper, vector, matrix, and tensor are denoted by boldface lower-case letters (e.g., x,y), boldface upper-case letters (e.g., X,Y), and script letters (e.g., $X$ , $Y$ ), respectively. For any set A, let |A| be the cardinality. The diag(x) is a diagonal matrix generated by x. For two vectors x and y, x ∘ y is the outer product. Define ‖x‖_q ≔ (|x₁|^q + ⋯ + |x_p|^q)^1/q. We also define the l₀ quasi-norm by ‖x‖₀ = #{j : x_j ≠ 0} and l_∞ norm by max_1≤j≤p |x_j|. Denote the set {1,2,…,n} by [n]. Let e_j be the canonical vectors, whose j-th entry equals to 1 and all other entries equal to zero. For any two sequences ${a_{n}}_{n = 1}^{\infty}$ , ${b_{n}}_{n = 1}^{\infty}$ , we say $a_{n} = O (b_{n})$ if there exists some positive constant C₀ and sufficiently large n₀ such that |a_n| ≤ C₀b_n for all n ≥ n₀. We also write a_n ≍ b_n if there exists C,c > 0 such that ca_n ≤ b_n ≤ Ca_n for all n ≥ 1. Additionally, C₁,C₂, …, c₁,c₂, … are generic constants, whose actual values may be different from line to line.

We next introduce notations and operations on the matrix. For matrices $A = [a_{1}, \dots, a_{J}] \in ℝ^{I \times J}$ and $B = [b_{1}, \dots, b_{L}] \in ℝ^{K \times L}$ , their Kronecker product is defined as a (IK)-by-(JL) matrix A ⊗ B = [a₁ ⊗ B ⋯a_J ⊗ B], where a_j ⊗ B = (a_j1B^⊤,…,a_jIB^⊤)^⊤. If A and B have the same number of columns J = L, the Khatri-Rao product is defined as $A ⊙ B = [a_{1} \circ b_{1}, a_{2} \circ b_{2}, \dots, a_{J} \circ b_{J}] \in ℝ^{I K \times J}$ . If the matrices A and B are of the same dimension, the Hadamard product is their element-wise matrix product, such that (A*B)_ij = A_ij ·B_ij. For matrix $X = [x_{1} \dots x_{n}] \in ℝ^{m \times n}$ , we also denote the vectorization $vec (X) = (x_{1}^{⊤}, \dots, x_{n}^{⊤}) \in ℝ^{1 \times m n}$ and column-wise $l_{2}$ norms as $Norm (X) = ({‖ x_{1} ‖}_{2}, \dots, {‖ x_{n} ‖}_{2}) \in ℝ^{1 \times n}$ .

In the end, we focus on tensor notation and relevant operations. Interested readers are referred to [2] for more details. Suppose $X \in ℝ^{p_{1} \times p_{2} \times p_{3}}$ is an order-3 tensor. Then the (i,j,k)-th element of $X$ is denoted by ${[X]}_{i j k}$ . The successive tensor multiplication with vectors $u \in ℝ^{p_{2}}$ , $v \in ℝ^{p_{3}}$ is denoted by $X \times_{2} u \times_{3} v = \sum_{j \in [p_{2}], l \in [p_{3}]} u_{j} v_{l} X_{[:, j, l]} \in ℝ^{p_{1}}$ . We say $X \in ℝ^{p_{1} \times p_{2} \times p_{3}}$ is rank-one if it can be written as the outer product of three vectors, i.e., $X = x_{1} \circ x_{2} \circ x_{3}$ or ${[X]}_{i j k} = x_{1 i} x_{2 j} x_{3 k}$ for all i,j,k. Here “∘” represents the vector outer product. We say $X$ is symmetric if ${[X]}_{i j k} = {[X]}_{i k j} = {[X]}_{j i k} = {[X]}_{j k i} = {[X]}_{k i j} = {[X]}_{k j i}$ for all i,j,k. Then, $X$ is rank-one and symmetric if and only if it can be decomposed as $X = x \circ x \circ x$ for some vector x.

More generally, we may decompose a tensor as the sum of rank one tensors as follows,

X = \sum_{k = 1}^{K} η_{k} x_{1 k} \circ x_{2 k} \circ x_{3 k},

(II.1)

where $η_{k} \in ℝ$ , $x_{1 k} \in S^{p_{1} - 1}$ , $x_{2 k} \in S^{p_{2} - 1}$ , $x_{3 k} \in S^{p_{3} - 1}$ . This is the so-called CANDECOMP/PARAFAC, or CP decomposition [2] with CP-rank being defined as the minimum number K such that (II.1) holds. Then, ${x_{1 k}}_{k = 1}^{K}, {x_{2 k}}_{k = 1}^{K}, {x_{3 k}}_{k = 1}^{K}$ are called factors along first, second and third mode. Note that factors are normalized as unit vectors to guarantee the uniqueness of decomposition, and η = {η₁,…,η_K} plays an analogous role of singular values in matrix value decomposition here. Several tensor norms also need to be introduced. The tensor Frobenius norm and tensor spectral norm are defined respectively as

‖ X ‖_{F} = \sqrt{\sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} \sum_{k = 1}^{p_{3}} X_{i j k}^{2}} ‖ X ‖_{o p} : = sup_{u \in ℝ^{p_{1}}, v \in ℝ^{p_{2}}, w \in ℝ^{p_{3}}} \frac{| 〈 X, u \circ v \circ w 〉 |}{‖ u ‖_{2} ‖ v ‖_{2} ‖ w ‖_{2}},

(II.2)

where $〈 X, Y 〉 = \sum_{i, j, k} X_{i j k} Y_{i j k}$ . Clearly, $‖ X ‖_{F}^{2} = 〈 X, X 〉$ . We also consider the following sparse tensor spectral norm,

‖ X ‖_{s} : = sup_{\begin{matrix} ‖ a ‖ = ‖ b ‖ = ‖ c ‖ = 1 \\ max {‖ a ‖_{0}, ‖ b ‖_{0}, ‖ c ‖_{0}} \leq s \end{matrix}} | 〈 X, a \circ b \circ c 〉 | .

(II.3)

By definition, $‖ X ‖_{s} \leq ‖ X ‖_{o p}$ . Suppose $X = x_{1} \circ x_{2} \circ x_{3}$ and $Y = y_{1} \circ y_{2} \circ y_{3}$ are two rank-one tensors. Then it is easy to check $‖ X ‖_{F} = {‖ x_{1} ‖}_{2} {‖ x_{2} ‖}_{2} {‖ x_{3} ‖}_{2}$ and $〈 X, Y 〉 = (x_{1}^{⊤} y_{1}) (x_{2}^{⊤} y_{2}) (x_{3}^{⊤} y_{3})$ .

III. Symmetric Tensor Estimation Via Cubic Sketchings

In this section, we focus on the estimation of sparse and low-rank symmetric tensors,

y_{i} = 〈 J^{*}, X_{i} 〉 + ϵ_{i}, X_{i} = x_{i} \circ x_{i} \circ x_{i} \in ℝ^{p \times p \times p}, i = 1, \dots, n,

(III.1)

where x_i are random vectors with i.i.d. standard normal entries. As previously discussed, the tensor parameter $J^{*}$ often satisfies certain low-dimensional structures in practice, among which the factor-wise sparsity and low-rankness [16] commonly appear. We thus assume $J^{*}$ is CP rank-K for K ≪ p and the corresponding factors are sparse,

J^{*} = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*}, with {‖ β_{k}^{*} ‖}_{2} = 1, {‖ β_{k}^{*} ‖}_{0} \leq s, \forall k \in [K] .

(III.2)

The CP low-rankness has been widely assumed in literature for its nice scalability and simple formulation [5, 25, 18]. Different from the matrix factor analysis, we do not assume the tensor factors $β_{k}^{*}$ here are orthogonal. On the other hand, since the low-rank tensor estimation is NP-hard in general [42], we will introduce an incoherence condition in the forthcoming Condition 3 to ensure that the correlation among different factors $β_{k}^{*}$ is not too strong. Such a condition has been used in recent literature on tensor data analysis [43], compressed sensing [44], matrix decomposition [45], and dictionary learning [46].

Based on observations ${y_{i}, X_{i}}_{i = 1}^{n}$ , we propose to estimate $J^{*}$ via minimizing the empirical squared loss since the close-form gradient provides computational convenience,

\hat{J} = \underset{J}{argmin} L (J) subject to J is sparse and low-rank,

(III.3)

where

L (J) = L (η_{k}, β_{1}, \dots, β_{K}) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - 〈 J, X_{i} 〉)}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3})}^{2} .

(III.4)

Equivalently, (III.3) can be written as,

min_{η_{k}, β_{k}} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3})}^{2}, s . t . {‖ β_{k} ‖}_{2} = 1, {‖ β_{k} ‖}_{0} \leq s, for k \in [K] .

(III.5)

Clearly, (III.5) is a non-convex optimization problem. To solve it, we propose a two-stage method as described in the next two subsections.

A. Initialization

Due to the non-convexity of (III.5), a straightforward implementation of many local search algorithms, such as gradient descent and alternating minimization, may easily get trapped into local optimums and result in sub-optimal statistical performance. Inspired by recent advances of spectral method (e.g., EM algorithm [47], phase retrieval [48], and tensor SVD [39]), we propose to evaluate an initial estimate ${η_{k}^{(0)}, β_{k}^{(0)}}$ via the method of moment and sparse tensor decomposition (a variant of high-order spectral method) in the following Steps 1 and 2, respectively. The pseudo-code is given in Algorithm 1.

Step 1: Unbiased Empirical Moment Estimator.

Construct the empirical moment-based estimator $T_{s}$ ,

T_{s} : = \frac{1}{6} [\frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i} \circ x_{i} \circ x_{i} - \sum_{j = 1}^{p} (m_{1} \circ e_{j} \circ e_{j} + e_{j} \circ m_{1} \circ e_{j} + e_{j} \circ e_{j} \circ m_{1})],

(III.6)

where $m_{1} : = \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i}$ , e_j is the canonical vector.

Based on Lemma 4, $T_{s}$ is an unbiased estimator of $J^{*}$ . The construction of (III.6) is motivated by the high-order Stein’s identity ([49]; also see Theorem 7 for a complete statement). Intuitively speaking, based on the third-order score function of a Gaussian random vector $x : S_{3} (x) = x \circ x \circ x - \sum_{j = 1}^{p} (x \circ e_{j} \circ e_{j} + e_{j} \circ x \circ e_{j} + e_{j} \circ e_{j} \circ x)$ , we can construct the unbiased estimator of $J^{*}$ by properly choosing a continuously differentiable function in high-order Stein’s identity. See the proof of Lemma 4 for details.

Step 2: Sparse Tensor Decomposition.

Based on the method of moment estimator obtained in Step 1, we further obtain good initialization for the factors ${η_{k}^{(0)}, β_{k}^{(0)}}$ via truncation and alternating rank-1 power iterations [27, 50],

T_{s} \approx \sum_{k = 1}^{K} η_{k}^{(0)} β_{k}^{(0)} \circ β_{k}^{(0)} \circ β_{k}^{(0)} .

Note that the tensor power iterations recover one rank-1 component per time. To identify all rank-1 components, we generate a large number of different initialization vectors, implement a clustering step, and choose the centroids as the estimates in the initialization stage. This scheme originally appears in tensor decomposition literature [43, 50], although our problem setting and proof techniques are very different. This procedure is also very different from the matrix setting since the rank-1 component in singular value decomposition is mutually orthogonal, but we do not enforce the exact orthogonality here for $J^{*}$ .

More specifically, we first choose a large integer M ≫ K and generate M starting vectors ${b_{m}^{(0)}}_{m = 1}^{M} \in ℝ^{p}$ through sparse SVD as described in Algorithm 3. Then for each $b_{m}^{(0)}$ , we apply the following truncated power updates for l = 0,…

{\tilde{b}}_{m}^{(l + 1)} = \frac{T_{s} \times_{2} b_{m}^{(l)} \times_{3} b_{m}^{(l)}}{{‖ T_{s} \times_{2} b_{m}^{(l)} \times_{3} b_{m}^{(l)} ‖}_{2}}, b_{m}^{(l + 1)} = \frac{T_{d} ({\tilde{b}}_{m}^{(l + 1)})}{{‖ T_{d} ({\tilde{b}}_{m}^{(l + 1)}) ‖}_{2}},

where ×₂, ×₃ are tensor multiplication operators defined in Section II and T_d(x) is a truncation operator that sets all but the largest d entries in absolute values to zero for any vector x. It is noteworthy that the symmetry of $T_{s}$ implies

T_{s} \times_{2} b_{m}^{(l)} \times_{3} b_{m}^{(l)} = T_{s} \times_{1} b_{m}^{(l)} \times_{3} b_{m}^{(l)} = T_{s} \times_{1} b_{m}^{(l)} \times_{2} b_{m}^{(l)} .

This means the multiplications along different modes are the same. We run power iterations till its convergence, and denote b_m as the outcome. Finally, we apply K-means to partition ${b_{m}}_{m = 1}^{M}$ into K clusters, let the centroids of the output clusters be ${β_{k}^{(0)}}_{k = 1}^{K}$ , and calculate $η_{k}^{(0)} = T_{s} \times_{1} β_{k}^{(0)} \times_{2} β_{k}^{(0)} \times_{3} β_{k}^{(0)} for k \in [K]$ .

graphic file with name nihms-1621480-t0001.jpg

Open in a new tab

B. Thresholded Gradient Descent

After obtaining a warm start in the first stage, we propose to apply the thresholded gradient descent to iteratively refine the solution to the non-convex optimization problem (III.5). Specifically, denote $X = (x_{1}, \dots, x_{n}) \in ℝ^{p \times n}$ , $y = {(y_{1}, \dots, y_{n})}^{⊤} \in ℝ^{n}$ , $η = {(η_{1}, \dots, η_{K})}^{⊤} \in ℝ^{K}$ , and $B = (β_{1}, \dots, β_{K}) \in ℝ^{p \times K}$ . Since $L (B, η) = L (J)$ , we let $\nabla_{B} L (B, η) = (\nabla_{β_{1}} L {(B, η)}^{⊤}, \dots, \nabla_{β_{K}} L {(B, η)}^{⊤}) \in ℝ^{1 \times p K}$ , be the gradient function with respect to B. Based on the detailed calculation in Lemma A.1, $\nabla_{B} L (B, η)$ can be written as

\nabla_{B} L (B, η) = \frac{6}{n} {[{{(B^{⊤} X)}^{⊤}}^{3} η - y]}^{⊤} * {[{({{(B^{⊤} X)}^{⊤}}^{2} ⊙ η^{⊤})}^{⊤} ⊙ X]}^{⊤},

(III.7)

where {(B^⊤X)^⊤}³ and {(B^⊤X)^⊤}² are entry-wise cubic and squared matrices of (B^⊤X)^⊤. Define φ_h(x) as the thresholding function with a level h that satisfies the following minimal assumptions:

| φ_{h} (x) - x | \leq h, \forall x \in ℝ, and φ_{h} (x) = 0, when | x | \leq h .

(III.8)

Many widely used thresholding schemes, such as hard thresholding H_h(x) = xI_(|x|>h), soft-thresholding S_h(x) = sign(x)max(|x| − h,x), satisfy (III.8). With a slight abuse of notation, we further define the vector thresholding function as $φ_{h} (x) = (φ_{h} (x_{1}), \dots, φ_{h} (x_{p})) for x \in ℝ^{p}$ .

The initial estimates η⁽⁰⁾ and B⁽⁰⁾ will be updated by thresholded gradient descent in two steps summarized in Algorithm 2. It is noteworthy that only B is updated in Step 3, while η will be updated in Step 4 after finishing the update of B.

Step 3: Updating B via Thresholded Gradient descent.

We update B^(t) via thresholded gradient descent,

vec (B^{(t + 1)}) = φ_{\frac{μ h (B^{(t)})}{ϕ}} (vec (B^{(t)}) - \frac{μ}{ϕ} \nabla_{B} L (B^{(t)}, η^{(0)})) .

(III.9)

Here,

μ is the step size and $ϕ = \sum_{i = 1}^{n} y_{i}^{2} / n$ serves as an approximation for ${(\sum_{k = 1}^{K} η_{k}^{*})}^{2}$ (see Lemma 15);
$h (B) \in ℝ^{1 \times K}$ is the thresholding level defined as
$h (B) = \sqrt{\frac{4 log n p}{n^{2}}} {[{{{(B^{⊤} X)}^{⊤}}^{3} η^{(0)} - y}^{2}]}^{⊤} * {{{(B^{⊤} X)}^{⊤}}^{2} ⊙ η^{(0) ⊤}}^{2} .$

Step 4: Updating η via Normalization.

We normalize each column of B^(T) and estimate the weight parameter as

\hat{B} = {({\hat{β}}_{1}, \dots, {\hat{β}}_{K})}^{⊤} = (\frac{β_{1}^{(T)}}{{‖ β_{1}^{(T)} ‖}_{2}}, \dots, \frac{β_{K}^{(T)}}{{‖ β_{K}^{(T)} ‖}_{2}}), \hat{η} = {({\hat{η}}_{1}, \dots, {\hat{η}}_{K})}^{⊤} = {(η_{1}^{(0)} {‖ β_{1}^{(T)} ‖}_{2}^{3}, \dots, η_{K}^{(0)} {‖ β_{K}^{(T)} ‖}_{2}^{3})}^{⊤} .

(III.10)

The final estimator for $J^{*}$ is

\hat{J} = \sum_{k = 1}^{K} {\hat{η}}_{k} {\hat{β}}_{k} \circ {\hat{β}}_{k} \circ {\hat{β}}_{k} .

Remark 1 (Stochastic Thresholded Gradient descent). The evaluation of the gradient (III.7) requires $O (n p K^{2})$ operations at each iteration and can be computationally intense for large n or p. To economize the computational cost, a stochastic version of thresholded gradient descent algorithm can be easily carried out by sampling a subset of summand functions (III.7) at each iteration. This will accelerate the procedure especially in the case of large-scale settings. See Section P2 for details.

graphic file with name nihms-1621480-t0002.jpg

Open in a new tab

graphic file with name nihms-1621480-t0003.jpg

Open in a new tab

IV. Theoretical Analysis

In this section, we establish the geometric convergence rate in optimization error and minimax optimal rate in statistical error of the proposed symmetric tensor estimator.

A. Assumptions

We first introduce the assumptions for theoretical analysis. Conditions 1–3 are on the true tensor parameter $J^{*}$ and Conditions 4–5 are on the measurement scheme. Specifically, the first condition ensures the model identifiability for CP-decomposition.

Condition 1 (Uniqueness of CP-decomposition). The CP-decomposition in (III.2) is unique in the sense that if there exists another CP-decomposition $J^{*} = \sum_{k = 1}^{K^{'}} η_{k}^{*^{'}} β_{k}^{*^{'}} \circ β_{k}^{*^{'}} \circ β_{k}^{*^{'}}$ , it must have K = K′ and be invariant up to a permutation of {1,…,K}.

For technical purposes, we introduce the following conditions to regularize the CP-decomposition of $J^{*}$ . Similar assumptions were imposed in recent tensor literature, e.g., [3, 27] and Assumption 1.1 (A4) [51].

Condition 2 (Parameter space). The CP-decomposition $J^{*} = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*}$ satisfies

{‖ J^{*} ‖}_{o p} \leq C η_{max}^{*}, K = O (s), R = η_{max}^{*} / η_{min}^{*} \leq C^{'}

(IV.1)

for some absolute constants C,C′, where $η_{min}^{*} = {min}_{k} η_{k}^{*}$ and $η_{max}^{*} = {max}_{k} η_{k}^{*}$ . Recall that s is the sparsity of $β_{k}^{*}$ .

Remark 2. In Condition 2, R plays a similar role as a “condition number.” This assumption means that the tensor is “well-conditioned,” i.e., each rank-1 component is roughly of the same size.

As shown in the seminal work of [42], the estimation of low-rank tensors can be NP-hard in general. Hence, we impose the following incoherence condition.

Condition 3 (Parameter incoherence). The true tensor components are incoherent such that

Γ : = max_{1 \leq k_{1} \neq k_{2} \leq K} | 〈 β_{k_{1}}^{*}, β_{k_{2}}^{*} 〉 | \leq min {C^{''} K^{- \frac{3}{4}} R^{- 1}, s^{- \frac{1}{2}}},

where R is the singular value ratio defined in (IV.1) and C″ is some small constant.

Remark 3. The preceding incoherence condition has been widely used in different scenarios in recent high-dimensional research, such as tensor decomposition [27, 50], compressed sensing [44], matrix decomposition [45], and dictionary learning [46]. It can be also viewed as a relaxation of orthogonality: if ${β_{1}^{*}, \dots, β_{K}^{*}}$ are mutually orthogonal, Γ equals zero. We can show from both theory (Lemma 28 in the supplementary materials) and simulation (Section VII) that the low-rank tensor $J^{*}$ induced by (III.2) satisfies the incoherence condition with high probability, if the component vectors $β_{k}^{*}$ are randomly generated, say from Gaussian distribution.

We also introduce the following conditions on noise distribution.

Condition 4 (Sub-exponential noise). The noise ${ϵ_{i}}_{i = 1}^{n}$ are i.i.d. randomly generated with mean 0 and variance σ² satisfying $0 < σ < C \sum_{k = 1}^{K} η_{k}^{*} \cdot (ϵ_{i} / σ)$ is sub-exponential distributed, i.e., there exists constant C_ϵ > 0 such that ${(ϵ_{i} / σ)}_{ψ_{1}} : = {sup}_{p \geq 1} p^{- 1} {(E {| ϵ_{i} / σ |}^{p})}^{1 / p} \leq C_{ϵ}$ and is independent of ${X_{i}}_{i = 1}^{n}$ .

The sample complexity condition is crucial for our algorithm especially in the initialization stage. Ignoring any polylog factors, Condition 5 is even weaker than the sparse matrix estimation case (n ≳ s²) in [48].

Condition 5 (Sample complexity).

n \geq C^{'''} K^{2} {(s log (e p / s))}^{\frac{3}{2}} {log}^{4} n .

B. Main Theoretical Results

Our main Theorem 1 shows that based on a proper initializer, the output of the proposed procedure can achieve optimal estimation error rate after a sufficient number of iterations. Here, we define the contraction parameter

0 < κ = 1 - 32 μ K^{- 2} R^{- \frac{8}{3}} < 1

and also denote $E_{1} = 4 K η_{max}^{* \frac{2}{3}} ε_{0}^{2}$ and $E_{2} = C_{0} η_{min}^{* - \frac{4}{3}} / 16$ for some C₀ > 0.

Theorem 1 (Statistical and Optimization Errors). Suppose Conditions 3–5 hold, $| supp (β_{k}^{(0)}) | ≲ s$ , and the initial estimator ${β_{k}^{(0)}, η_{k}^{(0)}}_{k = 1}^{K}$ satisfy

max_{1 \leq k \leq K} {{‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2}, | η_{k}^{(0)} - η_{k}^{*} |} ≲ K^{- 1}

(IV.2)

with probability at least $1 - O (1 / n)$ . Assume the step size μ ≤ μ₀, where μ₀ is defined in (A.14). Then, the output of the thresholded gradient descent update in (III.9) satisfies:

For any t = 0,1,2,…, the factor-wise estimator satisfies
$\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}^{(0)}} β_{k}^{(t + 1)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq E_{1} κ^{t} + E_{2} \frac{σ^{2} s log p}{n}$ (IV.3)
with probability at least $1 - O (t K s / n)$ .
When the total number of iterations is no smaller than
$T^{*} = (log (\frac{n}{σ^{2} s log p} \lor 1) + log \frac{E_{1}}{E_{2}}) / log κ^{- 1},$ (IV.4)
there exists a constant C₁ (independent of K,s,p,n,σ²) such that the final estimator $\hat{J} = \sum_{k = 1}^{K} η_{k}^{(0)} β_{k}^{(T^{*})} \circ β_{k}^{(T^{*})} \circ β_{k}^{(T^{*})}$ satisfies
${‖ \hat{J} - J^{*} ‖}_{F}^{2} \leq \frac{C_{1} σ^{2} K s log p}{n}$ (IV.5)
with probability at least $1 - O (T^{*} K s / n)$ .

Remark 4. The error bound (IV.3) can be decomposed into an optimization error $E_{1} κ^{t}$ (which decays with a geometric rate as iterations) and a statistical error $E_{2} \frac{σ^{2} s log p}{n}$ (which does not decay as iterations). In the special case that σ = 0, $\hat{J}$ exactly recover $J^{*}$ with high probability.

The next theorem shows that Steps 1 and 2 of Algorithm 1 provides a good initializer required in Theorem 1.

Theorem 2 (Initialization Error). Recall $Γ = {max}_{1 \leq k_{1} \neq k_{2} \leq K} | β_{k_{1}}^{*}, β_{k_{2}}^{*} |$ . Suppose the number of initializations $L \geq K^{C_{3} γ^{- 4}}$ , where γ is a constant defined in (A.11). Given that Conditions 1–4 hold, the initial estimator obtained from Steps 1–2 with a truncation level s ≤ d ≤ Cs satisfies

max_{1 \leq k \leq K} {{‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2}, | η_{k}^{(0)} - η_{k}^{*} |} \leq C_{2} K R δ_{n, p, s} + \sqrt{K} Γ^{2}

(IV.6)

and

| supp (β_{k}^{(0)}) | ≲ s

with probability at least 1 − 5/n, where

δ_{n, p, s} = {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (e p / s)}{n^{2}}} + \sqrt{\frac{s log (e p / s)}{n}}) .

(IV.7)

Moreover, if the sample complexity condition 5 holds, then the above bound satisfies (IV.2).

Remark 5 (Interpretation of initialization error). The upper bound of (IV.6) consists of two terms that correspond to the approximation error of $T_{s}$ to $J^{*}$ and the incoherence among $β_{k}^{*}$ ’s, respectively. Especially, the former converges to zero as n grows while the latter does not.

The proof of Theorems 1 and 2 are postponed to Section C-D in the supplementary materials. The combination of Theorems 1 and 2 immediately yields the following upper bound for the final estimator, which is one main result of this paper.

Theorem 3 (Upper Bound). Suppose Conditions 1 – 5 hold, s ≤ d ≤ Cs. After T* iterations, there exists a constant C₁ not depending on K,s,p,n,σ², such that the proposed procedure yields

{‖ \hat{J} - J^{*} ‖}_{F}^{2} \leq \frac{C_{1} σ^{2} K s log p}{n}

(IV.8)

with probability at least $1 - O (T^{*} K s / n)$ , where T* is defined in (IV.4).

The above upper bound turns out to match the minimax lower bound for a large class of sparse and low-rank tensors.

Theorem 4 (Lower Bound). Consider the following class of sparse and low-rank tensors,

F_{p, K, s} = {\begin{matrix} J = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k}, \\ J : where {‖ β_{k} ‖}_{0} \leq s, for k \in [K], \\ J satisfies Conditions 1 - 3 \end{matrix}} .

(IV.9)

Suppose that ${X_{i}}_{i = 1}^{n}$ are i.i.d standard normal cubic sketchings with i.i.d. N(0,σ²) noise in (III.1), p ≥ 20s, and s ≥ 4. We have the following lower bound result,

inf_{\tilde{T}} sup_{T \in F_{p, K, s}} E ‖ \tilde{J} - J ‖_{F}^{2} \geq c σ^{2} \frac{K s log (e p / s)}{n} .

The proof of Theorem 4 is deferred to Section E in the supplementary materials. Combining Theorems 3 and 4, we immediately obtain the following minimax-optimal rate for sparse and low-rank tensor estimation with cubic sketchings when logp ≍ log(p/s):

inf_{\tilde{T}} sup_{T^{*} \in F_{p}, K, s} E {‖ \tilde{J} - J^{*} ‖}_{F}^{2} ≍ σ^{2} \frac{K s log (e p / s)}{n} .

(IV.10)

The rate in (IV.10) sheds light upon the effect of dimension p, noise level σ², sparsity s, sample size n and rank K to the estimation performance.

Remark 6. Recently, Li, Haupt, and Woodruff [29] studied the optimal sketching for the low-rank tensor regression and gave an near-optimal sketching complexity with a sharp (1 + ε)-worse-case error bound. Different from the framework of [29] that focuses on a deterministic setting, we study a probabilistic model with random observation noises, propose a new algorithm, and studied the minimax optimal rate of estimation errors. In addition, [5, 16, 17] considered different types of convex/non-convex algorithms for low-rank tensor regression with statistical assumptions. To our best knowledge, we are the first to achieve an optimal rate in estimation error based on polynomial-time algorithms for the tensor regression problem.

Remark 7 (Non-sparse low-rank tensor estimation via cubic-sketchings). When the low-rank tensor $J^{*}$ is not necessarily sparse, i.e.,

J^{*} \in F_{p, K} = {J : \begin{array}{l} J = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k}, \\ J satisfies Conditions 1 - 3 \end{array}},

we can apply the proposed procedure with all the truncation/thresholding steps removed. If $n \geq O (p^{3 / 2})$ , we can use similar arguments of Theorems 1–3 to show that the estimator $\hat{J^{'}}$ satisfies

{‖ \hat{J^{'}} - J^{*} ‖}_{F}^{2} ≲ \frac{σ^{2} K p}{n}

(IV.11)

for any $J^{*} \in F_{p, K}$ with high probability. Furthermore, similar arguments of Theorem 4 imply that the rate in (IV.11) is minimax optimal.

Remark 8 (Comparison with existing matrix results). Our cubic sketching tensor results are far more than extensions of the existing matrix ones. For example, [32, 33] studied the low-rank matrix recovery via rank-1 projections: $y_{i} = x_{i}^{⊤} T x_{i} + ϵ_{i}$ and proposed the convex nuclear norm minimization methods. The theoretical properties of their estimate are analyzed under a $l_{1} / l_{2} -RIP$ or Restricted Uniform Boundedness condition (RUB). However, the tensor nuclear norm is computationally infeasible and one can check that our cubic sketching framework does not satisfy RIP or RUB conditions in general following the arguments in [48, 52]. Thus, these previous results cannot be directly applied.

In addition, the analysis of gradient updates for the tensor case is significantly more complicated than the matrix case. First, it requires high-order concentration inequalities for the tensor case since the cubic-sketching tensor leads to high-order products of sub-Gaussian random variables (see Section IV-C for details). The necessity of high-order expansions in the analysis of gradient updates for the tensor case also significantly increases the hardness of the problem. To ensure the geometric convergence, we need much more subtle analysis comparing to the ones in the matrix case [52].

C. Key Lemmas: High-order Concentration Inequalities

As mentioned earlier, one major challenge for theoretical analysis of cubic sketching is to handle heavy tails of high-order Gaussian moments. One can only handle up-to second moments of sub-Gaussian random variables by directly applying the Hoeffding’s or Bernstein’s concentration inequalities. Therefore, we need to develop the following high-order concentration inequalities as technical tools: Lemma 1 characterizes the tail bounds for the sum of sub-Gaussian products, and Lemma 2 provides the concentration inequalities for Gaussian cubic sketchings. The proofs of Lemmas 1 and 2 are given in Section B.

Lemma 1 (Concentration inequality for sum of sub-Gaussian products). Suppose $X_{i} = {(x_{1 i}^{⊤}, \dots, x_{m i}^{⊤})}^{⊤} \in ℝ^{m \times p}, i \in [n]$ are n i.i.d random matrices. Here, suppose x_ij, the j-th row of X_i, is an isotropic sub-Gaussian vector, i.e., $E x_{i j} = 0$ and Cov(x_ij) = I. Then for any vectors $a = (a_{1} \dots, a_{n}) \in ℝ^{n}$ , ${β_{j}}_{j = 1}^{m} \subseteq ℝ^{p}$ , and 0 < δ < 1, we have

| \sum_{i = 1}^{n} a_{i} \prod_{j = 1}^{m} (x_{i j}^{⊤} β_{j}) - E (\sum_{i = 1}^{n} a_{i} \prod_{j = 1}^{m} (x_{i j}^{⊤} β_{j})) | \leq C \prod_{i = 1}^{m} {‖ β_{j} ‖}_{2} (‖ a ‖_{\infty} {(log δ^{- 1})}^{m / 2} + ‖ a ‖_{2} {(log δ^{- 1})}^{1 / 2})

with probability at least 1 – δ for some constant C.

Note that in Lemma 1, each X_i does not necessarily have independent entries, even though ${X_{i}}_{i = 1}^{n}$ are independent matrices. Building on Lemma 1, Lemma 2 provides a generic spectral-type concentration inequality that can be used to quantify the approximation error of $T_{s}$ introduced in Step 1 of the proposed procedure.

Lemma 2 (Concentration inequality for Gaussian cubic sketchings). Suppose ${x_{1 i}}_{i = 1}^{n} \overset{i i d}{~} N (0, I_{p_{1}})$ , ${x_{2 i}}_{i = 1}^{n} \overset{i i d}{~} N (0, I_{p_{2}})$ , ${x_{3 i}}_{i = 1}^{n} \overset{i i d}{~} N (0, I_{p_{3}})$ , $β_{1} \in ℝ^{p_{1}}$ , $β_{2} \in ℝ^{p_{2}}$ , $β_{3} \in ℝ^{p_{3}}$ are fixed vectors.

Define $M_{nsy} = \frac{1}{n} \sum_{i = 1}^{n} 〈 x_{1 i} \circ x_{2 i} \circ x_{3 i}, β_{1} \circ β_{2} \circ β_{3} 〉 x_{1 i} \circ x_{2 i} \circ x_{3 i}$ . Then $E (M_{nsy}) = β_{1} \circ β_{2} \circ β_{3}$ and
${‖ M_{nsy} - E (M_{nsy}) ‖}_{s} \leq C {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (e p / s)}{n^{2}}} + \sqrt{\frac{s log (e p / s)}{n}}) {‖ β_{1} ‖}_{2} {‖ β_{2} ‖}_{2} {‖ β_{3} ‖}_{2},$
with probability at least 1 − 10/n³ − 1/p.
Define $M_{sym} = \frac{1}{n} \sum_{i = 1}^{n} 〈 x_{1 i} \circ x_{1 i} \circ x_{1 i}, β_{1} \circ β_{1} \circ β_{1} 〉 x_{1 i} \circ x_{1 i} \circ x_{1 i}$ . Then $E (M_{sym}) = 6 β_{1} \circ β_{1} \circ β_{1} + 3 \sum_{m = 1}^{p} (β_{1} \circ e_{m} \circ e_{m} + e_{m} \circ β_{1} \circ e_{m} + e_{m} \circ e_{m} \circ β_{1})$ and
${‖ M_{sym} - E (M_{sym}) ‖}_{s} \leq C {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (e p / s)}{n^{2}}} + \sqrt{\frac{s log (e p / s)}{n}}) {‖ β_{1} ‖}_{2}^{3},$
with probability at least 1 − 10/n³ − 1/p.

Here, C is an absolute constant and ‖ · ‖_s is the sparse tensor spectral norm defined in (II.3).

V. Application To High-Order Interaction Effect Models

In this section, we study the high-order interaction effect model in the cubic sketching framework. Specifically, we consider the following three-way interaction model

y_{l} = ξ_{0} + \sum_{i = 1}^{p} ξ_{i} z_{l i} + \sum_{i, j = 1}^{p} γ_{i j} z_{l i} z_{l j} + \sum_{i, j, k = 1}^{p} η_{i j k} z_{l i} z_{l j} z_{l k} + ϵ_{l},

(V.1)

for l = 1,…,n. Here ξ, γ, and η are coefficients for the main effect, pairwise interaction, and triple-wise interaction, respectively. More importantly, (V.1) can be reformulated into the following tensor form (also see the left panel of Figure 1)

y_{l} = 〈 B, x_{l} \circ x_{l} \circ x_{l} 〉 + ϵ_{l}, l = 1, \dots, n,

(V.2)

where $x_{l} = {(1, z_{l}^{⊤})}^{⊤} \in ℝ^{p + 1}$ and $B \in ℝ^{(p + 1) \times (p + 1) \times (p + 1)}$ is a tensor parameter corresponding to coefficients in the following way:

{\begin{array}{l} B_{[0, 0, 0]} = ξ_{0}, \\ B_{[1 : p, 1 : p, 1 : p]} = {(η_{i j k})}_{1 \leq i, j, k \leq p}, \\ B_{[0, 1 : p, 1 : p]} = B_{[1 : p, 0, 1 : p]} = B_{[1 : p, 1 : p, 0]} = {(γ_{i j} / 3)}_{1 \leq i, j \leq p}, \\ B_{[0, 0, 1 : p]} = B_{[0, 1 : p, 0]} = B_{[1 : p, 0, 0]} = {(ξ_{i} / 3)}_{1 \leq i \leq p} . \end{array}

(V.3)

We provide the following justification for assuming the tensorized coefficient $B$ is low-rank and sparse. First, in modern applications, such as the biomedical research [53], the response is often driven by a small portion of coefficients and a small number of factors, leading to a highly entry-wise sparse and low-rank $B$ . Second, [54] suggested that it is suitable to model entry-wise sparse and low-enough rank tensors as arising from sparse loadings. Therefore, we assume $B$ is CP rank-K with s-sparse factors:

B = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k}, {‖ β_{k} ‖}_{0} \leq s,

where K,s ≪ p. Then the number of parameters in (V.4), K(p + 1), is significantly smaller than (p + 1)³, the total number of parameters in the original three-way interaction effect model (V.1), which makes the consistent estimation of $B$ possible in the high-dimensional case. In this case, (V.2) can be written as

y_{l} = 〈 \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k}, x_{l} \circ x_{l} \circ x_{l} 〉 + ϵ_{l},

(V.4)

where l ∈ [n], ‖β_k‖₂ = 1, ‖β_k‖₀ ≤ s,k ∈ [K].

By assuming $z_{l} \overset{i i d}{~} N_{p} (0, I_{p})$ , the high-order interaction effect model (V.2) reduces to the symmetric tensor estimation model (III.1), except one slight difference that the first coordinate of x_l, i.e., the intercept, is always 1. To accommodate this difference, we only need to adjust the initial unbiased estimate in the above two-step procedure. Let

T_{s} = \frac{1}{6 n} \sum_{l = 1}^{n} y_{l} x_{l} \circ x_{l} \circ x_{l} - \frac{1}{6} \sum_{j = 1}^{p} (a \circ e_{j} \circ e_{j} + e_{j} \circ a \circ e_{j} + e_{j} \circ e_{j} \circ a),

(V.5)

where $a = \frac{1}{n} \sum_{l = 1}^{n} y_{l} x_{l}$ . Then we construct the empirical moment-based initial tensor Ts′ as

For i,j,k ≠ 0, $T_{s^{'} [i, j, k]} = T_{s [i, j, k]}$ , $T_{s^{'} [i, j, 0]} = T_{s [i, j, 0]}$ , $T_{s^{'} [0, j, k]} = T_{s [0, j, k]}$ , and $T_{s^{'}} [i, 0, k] = T_{s [i, 0, k]}$ .
For i ≠ 0, $T_{s^{'} [0, 0, i]} = T_{s^{'} [0, i, 0]} = T_{s^{'} [i, 0, 0]} = \frac{1}{3} T_{s [0, 0, i]} - \frac{1}{6} (\sum_{k = 1}^{p} T_{s [k, k, i]} - (p + 2) a_{i})$ .
$T_{s^{'} [0, 0, 0]} = \frac{1}{2 p - 2} (\sum_{k = 1}^{p} T_{s [0, k, k]} - (p + 2) T_{s [0, 0, 0]})$ .

Lemma 5 shows that $T_{s^{'}}$ is an unbiased estimator for $B$ .

The theoretical results in Section IV imply the following upper and lower bounds for the three-way interaction effect estimation.

Corollary 1. Suppose z₁,…,z_n are i.i.d. standard Gaussian random vectors and $B$ satisfies Conditions 1, 2 and 3. The output, denoted as $\hat{B}$ , from the proposed Algorithms 1 and 2 based on $T_{s^{'}}$ satisfies

‖ \hat{B} - B ‖_{F}^{2} \leq C \frac{σ^{2} K s log p}{n}

(V.6)

with high probability. On the other hand, considering the following class of $B$ ,

F_{p + 1, K, s} = {\begin{matrix} B = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k} \\ B : where {‖ β_{k} ‖}_{0} \leq s, for k \in [K], \\ B satisfies Conditions 1 - 3, \end{matrix}} .

Then the following lower bound holds,

inf_{\hat{B}} sup_{B \in F_{p + 1, K, s}} E ‖ \hat{B} - B ‖_{F}^{2} \geq C \frac{σ^{2} K s log p}{n} .

VI. Non-Symmetric Tensor Estimation Model

In this section, we extend the previous results to the non-symmetric tensor case. Specifically, we have $J^{*} \in ℝ^{p_{1} \times p_{2} \times p_{3}}$ and

y_{i} = 〈 T^{*}, X_{i} 〉 + ϵ_{i}, X_{i} = u_{i} \circ v_{i} \circ w_{i}, i \in [n],

(VI.1)

where $u_{i} \in ℝ^{p_{1}}$ , $v_{i} \in ℝ^{p_{2}}$ , $w_{i} \in ℝ^{p_{3}}$ are random vectors with i.i.d. standard normal entries. Again, we assume $J^{*}$ is sparse and low-rank in a similar sense that

J^{*} = \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*}, {‖ β_{1 k}^{*} ‖}_{2} = {‖ β_{2 k}^{*} ‖}_{2} = {‖ β_{3 k}^{*} ‖}_{2} = 1, max {{‖ β_{1 k}^{*} ‖}_{0}, {‖ β_{2 k}^{*} ‖}_{0}, {‖ β_{3 k}^{*} ‖}_{0}} \leq s .

(VI.2)

Denote

B₁ = (β₁₁, ⋯, β_1K), B₂ = (β₂₁, ⋯, β_2K), B₃ = (β₃₁, ⋯, β_3K),
U = (u₁,…,u_n), V = (v₁,…,v_n), W = (w₁,…,w_n), η = (η₁,…,η_k)^⊤,y = (y₁,…,y_n)^⊤.

Then, the empirical risk function can be written compactly as

L (B_{1}, B_{2}, B_{3}, η) = \frac{1}{n} {‖ (U^{⊤} B_{1}) * (V^{⊤} B_{2}) * (W^{⊤} B_{3}) \cdot η - y ‖}_{2}^{2} .

(VI.3)

Since (VI.3) is non-convex but fortunately tri-convex in terms of B₁, B₂, and B₃, we develop a block-wise thresholded gradient descent algorithm as detailed below. The complete algorithm is deferred to Section O1 in the supplementary materials.

Step 1: (Method of Tensor Moments)

Construct the empirical moment-based estimator

T : = \frac{1}{n} \sum_{i = 1}^{n} y_{i} u_{i} \circ v_{i} \circ w_{i} \in ℝ^{p_{1} \times p_{2} \times p_{3}}

(VI.4)

to which sparse tensor decomposition is applied for initialization.

Step 2: (Block-wise Gradient Descent)

Lemma 17 shows that the gradient function for (VI.3) with respect to B₁ can be written as

\nabla_{B_{1}} L (B_{1}, B_{2}, B_{3}, η) = D^{⊤} {(C_{1}^{⊤} ⊙ U)}^{⊤} \in ℝ^{1 \times (p_{1} K)},

(VI.5)

where $D = {(B_{1}^{⊤} U)}^{⊤} * {(B_{2}^{⊤} V)}^{⊤} * {(B_{3}^{⊤} W)}^{⊤} η - y$ and $C_{1} = {(B_{2}^{⊤} V)}^{⊤} * {(B_{3}^{⊤} W)}^{⊤} ⊙ η^{⊤}$ . For t = 1, …, T, we fix $B_{2}^{(t)}$ , $B_{3}^{(t)}$ and update $B_{1}^{(t + 1)}$ via block-wise thresholded gradient descent,

vec (B_{1}^{(t + 1)}) = φ_{\frac{μ h (B_{1}^{(t)})}{ϕ}} (vec (B_{1}^{(t)}) - \frac{μ}{ϕ} \nabla_{B_{1}} L (B_{1}^{(t)}, B_{2}^{(t)}, B_{3}^{(t)}, η)),

where $ϕ = \sum_{i = 1}^{n} y_{i}^{2} / n$ , μ is the step size, and $h (B)= \sqrt{\frac{4 log n p}{n^{2}} {D^{2}}^{⊤} {C^{2}}}$ . The updates of B₂,B₃ are similar.

The theoretical analysis for the non-symmetric case is different from the symmetric one in two folds. First, the non-symmetric cubic sketching tensor is formed by three Gaussian vectors rather than one, which leads to many differences in the calculation of high-order moments. Second, the CP-decomposition of non-symmetric tensor $J^{*}$ (VI.2) forms a tri-convex optimization. At this point, the standard convex analysis for vanilla gradient descent [55] could be applied given a proper initialization.

With the regularity conditions detailed in Section O1, we present the theoretical results for non-symmetric tensor estimation as follows.

Theorem 5 (Upper Bound). Suppose Conditions 6 – 9 hold and n ≳ (slog(p₀/s))^3/2, where p₀ = max{p₁,p₂,p₃}. For any t = 0,1,2,…, the output of Algorithm O1 satisfies

\sum_{k = 1}^{K} \sum_{j = 1}^{3} {‖ \sqrt[3]{η_{k}} β_{j k}^{(t + 1)} - \sqrt[3]{η_{k}^{*}} β_{j k}^{*} ‖}_{2}^{2} \leq O_{p} (κ^{t} + \frac{σ^{2} s log p_{0}}{n})

for some 0 < κ < 1. When the total number of iterations is no smaller than $log (\frac{n}{σ^{2} s log p_{0}} \lor 1) / log κ^{- 1}$ , the final estimator $\hat{J}$ satisfies

{‖ \hat{J} - J^{*} ‖}_{F}^{2} \leq O_{p} (\frac{σ^{2} K s log p_{0}}{n}) .

Theorem 6 (Lower Bound). Consider the class of incoherent sparse and low-rank tensors $F = {J : J = \sum_{k = 1}^{K} β_{1 k} \circ β_{2 k} \circ β_{3 k}, {‖ β_{i, k} ‖}_{0} \leq s for i = 1, 2, 3, k = 1, \dots, K}$ . If ${X_{i}}_{i = 1}^{n}$ are i.i.d standard normal cubic sketchings, $ϵ \overset{i i d}{~} N (0, σ^{2})$ , min{p₁,p₂,p₃} ≥ 20s, and s ≥ 4, we have

inf_{\hat{J}} sup_{J \in F} E ‖ \hat{J} - J ‖_{F}^{2} \geq \frac{C σ^{2} s K log (e \cdot p_{0} / s)}{n} .

(VI.6)

Theorems 5 and 6 imply that the proposed algorithm achieves a minimax-optimal rate of estimation error in the class of $F$ as long as log(p₀) ≍ log(p₀/s).

VII. Numerical Results

In this section, we investigate the effect of noise level, CP-rank, sample size, dimension, and sparsity on the estimation performance by simulation studies. We also investigate the numerical performance of the proposed algorithm when the incoherence assumption required in the theoretical analysis fails to hold.

In each setting, we generate $J^{*} = \sum_{k = 1}^{K} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*}$ , where $| supp (β_{k}^{*}) | = s$ , the support of $β_{k}^{*}$ is uniformly selected from {1,…,p}, and the nonzero entries of $β_{k}^{*}$ are drawn randomly from standard normal distribution. Then, we calculate $η_{k}^{*} \leftarrow {‖ β_{k}^{*} ‖}_{2}^{3}$ and normalize $β_{k}^{*} \leftarrow β_{k}^{*} / {‖ β_{k}^{*} ‖}_{2}$ . The cubic sketchings ${X_{i}}_{i = 1}^{n}$ are generated as $X_{i} = x_{i} \circ x_{i} \circ x_{i}$ and $x_{i} \overset{i i d}{~} N (0, 1)$ . The noise satisfies ${ϵ_{i}}_{i = 1}^{n} \overset{i ı d}{~} N (0, σ^{2})$ or $Laplace (0, σ / \sqrt{2})$ . Additionally, we adopt the following stopping rules in iterations: (1) the initialization iteration (Step 2 in Algorithm 1) is stopped if ${‖ b_{m}^{(l + 1)} - b_{m}^{(l)} ‖}_{2} \leq 10^{- 6}$ ; (2) the gradient update iteration (Step 3 in Algorithm 2) is stopped if ‖B^(T+1) − B^(T) ‖_F ≤ 10⁻⁶. The numerical results are based on 200 repetitions unless otherwise specified. The code was written in R and implemented on an Intel Xeon-E5 processor with 64 GB of RAM.

First, we consider the percentage of successful recovery in the noiseless case. Let K = 3, s/p = 0.3, p = 30 or 50, so that the total number of unknown parameters in $J^{*}$ is 2.7 × 10⁴ or 1.25 × 10⁵. The sample size n ranges from 500 to 6000. Each recovery is called “successful” if the relative error ${‖ \hat{J} - J^{*} ‖}_{F} / {‖ J^{*} ‖}_{F} < 10^{- 4}$ . We report the average successful recovery rate in Figure VII. We can see from Figure VII that the empirical relation among successful recovery, dimension, and sample size is consistent with the theoretical results in Section IV.

We then move to the noisy case. Select K = 3, s/p = 0.3, p ∈ {30, 50}, ${ϵ_{i}}_{i = 1}^{n} \overset{i i d}{~} N (0, σ^{2})$ . We consider two scenarios: (1) sample size n = 6000, 8000, or 10000, s/p = 0.3, the noise level σ varies from 0 to 200; (2) noise level σ = 200, sample size n varies from 4000 to 10000, p = 30, s/p = 0.1,0.3,0.5. The estimation errors in terms of ${‖ \hat{J} - J^{*} ‖}_{F} / {‖ J^{*} ‖}_{F}$ in these two scenarios are plotted in Figures 3 and 4, respectively. These results show that the proposed procedure achieves a good performance – Algorithms 1 and 2 yield more accurate estimation with smaller variance σ² and/or large value of sample size n.

Fig. 3. — Estimation error under different noise levels. Left panel: p = 30, right panel: p = 50

Fig. 4. — Estimation error under different dimension/sample ratio (*n/p*³). Left panel: initial estimation error, right panel: final estimation error

Next, we demonstrate that the low-rank tensor parameter $J^{*}$ with randomly generated factors $β_{k}^{*}$ satisfies the incoherence condition 3 with high probability. Set the CP-rank K = 3 and the sparsity level s/p = 0.3 with the dimension p ranging from 10 to 2000. We compute the incoherence parameter Γ defined in Condition 3. The left panel of Figure 5 shows that the incoherence parameter Γ decays in a polynomial rate as s grows, which matches the bound in Condition 3. Recall a theoretical justification on this point is also provided in Lemma 28.

Fig. 5. — Left panel: incoherence parameter Γ with varying sparsity. Here, the red line corresponds to the rate $\sqrt{s}$ required in the theoretical analysis. Right panel: average relative estimation error for tensors with varying incoherence.

We further examine the performance of the proposed algorithm when the incoherence condition required in the theoretical analysis fails to hold. Specifically, we set the CP-rank K = 3, p = 30, and the sparsity level s/p = 0.3. We construct enormous copies of tensor parameter $J_{j}^{*}$ with i.i.d. standard normal factor vectors $β_{k}^{*}$ . For each $J_{j}^{*}$ , we calculate the incoherence Γ_j defined in Condition 3, then manually pick 40 $J_{j^{'}}^{*}$ such that

0.01 \cdot (j^{'} - 1) \leq Γ_{j^{'}} \leq 0.01 \cdot j^{'} for j^{'} = {1, 2, \dots, 40} .

In this way, we obtain a set of tensor parameters ${J_{j^{'}}^{*}}$ with incoherence uniformly varying from 0 to 0.4. The right panel of Figure 5 plots the relative error for estimating $J^{*}$ based on observations from cubic sketchings of $J_{j^{'}}^{*}$ based on 1000 repetitions. We can see that the proposed algorithm achieves small relative errors even when the true factors are highly coherent.

Moreover, we consider a setting with Laplacian noise. Suppose ${ϵ_{i}}_{i = 1}^{n} \overset{i i d}{~} L a p (σ)$ with density $f (x) = \frac{1}{σ} exp (- 2 | x | / σ)$ . With n = 3000, p = 30, and varying values of σ, the average estimation error and its comparison with Gaussian noise setting are provided in Figure 6. We note that the estimation errors under Laplace noise are slightly higher than those under Gaussian noise.

Fig. 6. — Comparison of estimation errors between Laplace error and Gaussian error

We also compare the estimation errors of initial and final estimators for different ranks and sample sizes. Set K = 3,p = 30,s/p = 0.3 and consider the noiseless setting. It is clear from Figure 7 that the initialization error decays sufficiently, but does not converge to zero as sample size n grows. This result matches our theoretical findings in Theorem 2: as discussed in Remark 5, the initial stage may yield an inconsistent estimator due to the incoherence among β_k’s. We also evaluate and compare the estimation errors for both initial and final estimators. From the right panel of Figure 7, we can see that the final estimator is more stable and accurate compared to the initial one, which illustrates the merit of thresholded gradient descent step of the proposed procedure.

Fig. 7. — Log relative estimation error of initial estimation error (left panel) and initialization/final estimation error (right panel)

Finally, we compare the performance of the proposed method with the alternating least square (ALS)-based tensor regression method [3]. We specifically consider two schemes for the initialization of ALS: (a) ${β_{k}^{(0)}}$ are i.i.d. standard Gaussian (cold start), and (b) ${β_{k}^{(0)}}$ are generated from the proposed Algorithm 1 (warm start). Setting K = 2, s/p = 0.2, p = 30, ${ϵ_{i}}_{i = 1}^{n} \overset{i i d}{~} N (0, 200^{2})$ , we apply both the proposed procedure and the ALS-based algorithm and record the average estimation errors with standard deviations for both initial and final estimators. From the result in Table VII, one can see the proposed algorithm significantly outperforms the ALS under both cold and warm start schemes. The main reason is pointed out in Remark 8: the cubic sketching setting possesses distinct aspects compared with the i.i.d. random Gaussian sketching setting, so that the method proposed by [3] does not exactly fit here.

VIII. Discussions

This paper focuses on the third order tensor estimation via cubic sketchings. Moreover, all results can be extended to the higher-order case via high-order sketchings. To be specific, suppose

y_{i} = 〈 J^{*}, x_{i}^{\otimes m} 〉 + ϵ_{i}, i = 1, \dots, n,

where $J^{*} \in {(ℝ^{p})}^{\otimes m}$ is an order-m, sparse, and low-rank tensor. In order to estimate $J^{*}$ based on ${y_{i}, x_{i}}_{i = 1}^{n}$ , one can first construct the order-m moment-based estimator using a generalized version of Theorem 7 and the fact that the score functions $S_{m} (x) = {(- 1)}^{m} \nabla^{m} p (x) / p (x)$ for the density function p(x) satisfy a nice recursive equation:

S_{m} (x) : = - S_{m - 1} (x) \circ \nabla log p (x) - \nabla S_{m - 1} (x) .

Then, one can similarly perform high-order sparse tensor decomposition and thresholded gradient descent to estimate $J^{*}$ . On the theoretical side, we can show if mild conditions hold and n ≥ C(logn)^m(slogp)^m/2, the proposed procedure achieves

{‖ \tilde{T} - T^{*} ‖}_{F}^{2} ≲ σ^{2} \frac{K m s log (p / s)}{n}

with high probability. The minimax optimality can be shown similarly.

Fig. 2. — Successful rate of recovery with varying sample size

TABLE I.

Estimation Error And Standard Deviation (In Subscript) Of The Proposed Method And Als-Based Method

Sample size	ours	warm start	cold start	initial
n = 4000	4.02_0.13	32.82_1.79	37.78_1.23	38.03_1.74
n = 5000	4.02_0.13	32.34_2.34	36.96_2.10	33.71_1.78
n = 6000	1.77_0.09	22.22_1.21	59.97_3.40	25.57_1.48

Open in a new tab

Acknowledgment

Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014- 18-2759). While completing this work, Guang Cheng was a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality. Anru Zhang would like to acknowledge support by NSF CAREER-1944904, NSF DMS-1811868, and NIH R01 GM131399.

Biography

Botao Hao received B.S. degree from the School of Mathematics, Nankai University, China, in 2014, and Ph.D. from the Department of Statistics, Purdue University, USA, 2019. He is currently a postdoctoral researcher in Department of Electrical Engineering, Princeton University, USA.

Anru Zhang received the Ph.D. degree from University of Pennsylvania, Philadelphia, PA, in 2015, and B.S. degree from Peking University, Beijing, China, in 2010. He is currently an assistant professor in Statistics at the University of Wisconsin-Madison, Madison, WI. His current research interests include high-dimensional statistical inference, tensor data analysis, statistical learning theory, dimension reduction, and convex/non-convex optimization.

Guang Cheng received BA degree in Economics from Tsinghua University, China, in 2002, and PhD degree from University of Wisconsin–Madison in 2006. He then joined Dept of Statistics at Duke University as Visiting Assistant Professor and Postdoc Fellow in SAMSI. He is currently Professor in Statistics at Purdue University, directing Big Data Theory research group, whose main goal is to develop computationally efficient inferential tools for big data with statistical guarantees.

Appendix

This appendix contains five parts: (1) Sections A–B provide detailed proofs for empirical moment estimator and concentration results; (2) Sections C–N provide additional proofs for the main theoretical results of this paper; (4) Section O covers the pseudo-code, conditions and main proofs of non-symmetric tensor estimation; (5) Section P discusses the matrix form of gradient function and stochastic gradient descent; (6) Section Q provides several technical lemmas and their proofs.

A. Moment Calculation

We first introduce three lemmas to show that the empirical moment based tensors (III.6), (V.5), and (VI.4) are all unbiased estimators for the target low-rank tensor in the corresponding scenarios. Detail proofs of three lemmas are postponed to Sections G1, G2 and G3 in the supplementary materials.

Lemma 3 (Unbiasedness of moment estimator under non-symmetric sketchings). For non-symmetric tensor estimation model (VI.1) & (VI.2), define the empirical moment-based tensor $T$ by

T : = \frac{1}{n} \sum_{i = 1}^{n} y_{i} u_{i} \circ v_{i} \circ w_{i} .

Then $T$ is an unbiased estimator for $J^{*}$ , i.e.,

E (T) = \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} .

The extension to the symmetric case is non-trivial due to the dependency among three identical sketching vectors. We borrow the idea of high-order Stein’s identity, which was originally proposed in [49]. To fix the idea, we present only third order result for simplicity. The extension to higher-order is straightforward.

Theorem 7 (Third-order Stein’s Identity, [49]). Let $x \in ℝ^{p}$ be a random vector with joint density function p(x). Define the third order score function $S_{3} (x) : ℝ^{p} \to ℝ^{p \times p \times p}$ as $S_{3} (x) = - \nabla^{3} p (x) / p (x)$ . Then for continuously differentiable function $G (x) : ℝ^{p} \to ℝ$ , we have

E [G (x) \cdot S_{3} (x)] = E [\nabla^{3} G (x)] .

(A.1)

In general, the order-m high-order score function is defined as

S_{m} (x) = {(- 1)}^{m} \frac{\nabla^{m} p (x)}{p (x)} .

Interestingly, the high-order score function has a recursive differential representation

S_{m} (x) : = - S_{m - 1} (x) \circ \nabla log p (x) - \nabla S_{m - 1} (x),

(A.2)

with $S_{0} (x) = 1$ . This recursive form is helpful for constructing unbiased tensor estimator under symmetric cubic sketchings. Note that the first order score function $S_{1} (x) = - \nabla log p (x)$ is the same as score function in Lemma 26 (Stein’s lemma [56]). The proof of Theorem 7 relies on iteratively applying the recursion representation of score function (A.2) and the first-order Stein’s lemma (Lemma 26). We provide the detailed proof in Section F for the sake of completeness.

In particular, if x follows a standard Gaussian vector, each order score function can be calculated based on (A.2) as follows,

\begin{array}{l} S_{1} (x) = x, S_{2} (x) = x \circ x - I_{d \times d}, \\ S_{3} (x) = x \circ x \circ x - \sum_{j = 1}^{p} (x \circ e_{j} \circ e_{j} + e_{j} \circ x \circ e_{j} + e_{j} \circ e_{j} \circ x) . \end{array}

(A.3)

Interestingly, if we let $G (x) = \sum_{k = 1}^{K} η_{k}^{*} {(x^{⊤} β_{k}^{*})}^{3}$ , then

\frac{1}{6} \nabla^{3} G (x) = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*},

(A.4)

which is exactly $J^{*}$ . Connecting this fact with (A.1), we are able to construct the unbiased estimator in the following lemma through high-order Stein’s identity.

Lemma 4 (Unbiasedness of moment estimator under symmetric sketchings). Consider the symmetric tensor estimation model (III.1) & (IV.9). Define the empirical first-order moment $m_{1} : = \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i}$ . If we further define an empirical third-order-moment-based tensor $T_{s}$ by

T_{s} : = \frac{1}{6} [\frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i} \circ x_{i} \circ x_{i} - \sum_{j = 1}^{p} (m_{1} \circ e_{j} \circ e_{j} + e_{j} \circ m_{1} \circ e_{j} + e_{j} \circ e_{j} \circ m_{1})],

then

E (T_{s}) = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*} .

Proof. Note that y_i = G(x_i) + ϵ_i. Then we have

E (\frac{1}{n} \sum_{i = 1}^{n} y_{i} S_{3} (x)) = E (\frac{1}{n} \sum_{i = 1}^{n} (G (x_{i}) + ϵ_{i}) S_{3} (x_{i})),

where $S_{3} (x)$ is defined in (A.3). By using the conclusion in Theorem 7 and the fact (A.4), we obtain

E (T_{s}) = E (\frac{1}{6 n} \sum_{i = 1}^{n} y_{i} S_{3} (x)) = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*},

since ϵ_i is independent of x_i. This ends the proof. ■

Although the interaction effect model (V.1) is still based on symmetric sketchings, we need much more careful construction for the moment-based estimator, since the first coordinate of the sketching vector is always constant 1. We give such an estimator in the following lemma.

Lemma 5 (Unbiasedness of moment estimator in interaction model). For interaction effect model (V.1), construct the empirical moment based tensor $T_{s^{'}}$ as following

For i,j,k ≠ 06, $T_{s^{'} [i, j, k]} = T_{s [i, j, k]}$ . And $T_{s^{'} [i, j, 0]} = T_{s [i, j, 0]}$ , $T_{s^{'} [0, j, k]} = T_{s [0, j, k]}$ , $T_{s^{'} [i, 0, k]} = T_{s [i, 0, k]}$ .
For i ≠ 0, $T_{s^{'} [0, 0, i]} = T_{s^{'} [0, i, 0]} = T_{s^{'} [i, 0, 0]} = \frac{1}{3} T_{s [0, 0, i]} - \frac{1}{6} (\sum_{k = 1}^{p} T_{s [k, k, i]} - (p + 2) a_{i})$ .
$T_{s^{'} [0, 0, 0]} = \frac{1}{2 p - 2} (\sum_{k = 1}^{p} T_{s [0, k, k]} - (p + 2) T_{s [0, 0, 0]})$

The $T_{s^{'}}$ is an unbiased estimator for $B$ i.e.,

E (T_{s^{'}}) = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k} .

B. Proofs of Lemmas 1 and 2: Concentration Inequalities

We aim to prove Lemmas 1 and 2 in this subsection. These two lemmas provide key concentration inequalities of the theoretical analysis for the main result. Before going into technical details, we introduce a quasi-norm called ψ_α-norm.

Definition 1 (ψ_α-norm [34]). The ψ_α-norm of any random variable X and α > 0 is defined as

‖ X ‖_{ψ_{α}} : = inf {C \in (0, \infty) : E [exp {(| X | / C)}^{α}] \leq 2} .

Particularly, a random variable who has a bounded ψ₂-norm or bounded ψ₁-norm is called sub-Gaussian or sub-exponential random variable, respectively. Next lemma provides an upper bound for the p-th moment of sum of random variables with bounded ψ_α-norm.

Lemma 6. Suppose X₁,…,X_n are n independent random variables satisfying ${‖ X_{i} ‖}_{ψ_{α}} \leq b$ with α > 0, then for all $a = (a_{1}, \dots, a_{n}) \in ℝ^{n}$ and p ≥ 2,

{(E {| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) |}^{p})}^{\frac{1}{p}} \leq {\begin{array}{l} C_{1} (α) b (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty}), & if 0 < α < 1; \\ C_{2} (α) b (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{α^{*}}), & if α \geq 1. \end{array}

(A.5)

where 1/α* + 1/α = 1, C₁(α),C₂(α) are some absolute constants only depending on α.

If 0 < α < 1, (A.5) is a combination of Theorem 6.2 in [57] and the fact that the p-th moment of a Weibull variable with parameter α is of order p^1/α. If α ≥ 1, (A.5) follows from a combination of Corollaries 2.9 and 2.10 in [58]. Continuing with standard symmetrization arguments, we reach the conclusion for general random variables. When α = 1 or 2, (A.5) coincides with standard moment bounds for a sum of sub-Gaussian and sub-exponential random variables in [59]. The detailed proof of Lemma 6 is postponed to Section H.

When 0 < α < 1, by Chebyshev’s inequality, one can obtain the following exponential tail bound for the sum of random variables with bounded ψ_α-norm. This lemma generalizes the Hoeffding-type concentration inequality for sub-Gaussian random variables (see, e.g. Proposition 5.10 in [59]), and Bernstein-type concentration inequality for sub-exponential random variables (see, e.g. Proposition 5.16 in [59]).

Lemma 7. Suppose 0 < α < 1, X₁,…,X_n are independent random variables satisfying ${‖ X_{i} ‖}_{ψ_{α}} \leq b$ . Then there exists absolute constant C(α) only depending on α such that for any $a = (a_{1}, \dots, a_{n}) \in ℝ^{n}$ and 0 < δ < 1/e²,

| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) | \leq C (α) b ‖ a ‖_{2} {(log δ^{- 1})}^{1 / 2} + C (α) b ‖ a ‖_{\infty} {(log δ^{- 1})}^{1 / α},

with probability at least 1 − δ.

Proof. For any t > 0, by Markov’s inequality,

ℙ (| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) | \geq t) = ℙ ({| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) |}^{p} \geq t^{p}) \leq \frac{E {| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) |}^{p}}{t^{p}} \leq \frac{C {(α)}^{p} b^{p} {(\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty})}^{p}}{t^{p}},

where the last inequality is from Lemma 6. We set t such that $exp (- p) = C {(α)}^{p} b^{p} {(\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty})}^{p} / t^{p}$ . Then for p ≥ 2,

| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) | \leq e C (α) b (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty})

holds with probability at least 1 − exp(−p) Letting δ = exp(−p), we have that for any 0 < δ < 1/e²,

| \sum_{i = 1}^{n} a_{i} X_{i} - E (\sum_{i = 1}^{n} a_{i} X_{i}) | \leq C (α) b (‖ a ‖_{2} {(log δ^{- 1})}^{1 / 2} + ‖ a ‖_{\infty} {(log δ^{- 1})}^{1 / α}),

holds with probability at least 1 − δ. This ends the proof. ■

The next lemma provides an upper bound for the product of random variables in ψ_α-norm.

Lemma 8 (ψ_α for product of random variables). Suppose X₁,…,X_m are m random variables (not necessarily independent) with ψ_α-norm bounded by ${‖ X_{j} ‖}_{ψ_{α}} \leq K_{j}$ . Then the ψ_α/m-norm of $\prod_{j = 1}^{m} X_{j}$ is bounded as

{‖ \prod_{j = 1}^{m} X_{j} ‖}_{ψ_{α / m}} \leq \prod_{j = 1}^{m} K_{j} .

Proof. For any ${x_{j}}_{j = 1}^{m}$ and α > 0, by using the inequality of arithmetic and geometric means we have

{(| \prod_{j = 1}^{m} \frac{x_{j}}{K_{j}} |)}^{α / m} = {(\prod_{j = 1}^{m} {| \frac{x_{j}}{K_{j}} |}^{α})}^{1 / m} \leq \frac{1}{m} \sum_{j = 1}^{m} {| \frac{x_{j}}{K_{j}} |}^{α} .

Since exponential function is a monotone increasing function, it shows that

exp {(| \prod_{j = 1}^{m} \frac{x_{j}}{K_{j}} |)}^{α / m} \leq exp (\frac{1}{m} \sum_{j = 1}^{m} {| \frac{x_{j}}{K_{j}} |}^{α}) = {(\prod_{j = 1}^{m} exp ({| \frac{x_{j}}{K_{j}} |}^{α}))}^{1 / m} \leq \frac{1}{m} \sum_{j = 1}^{m} exp ({| \frac{x_{j}}{K_{j}} |}^{α}) .

(A.6)

From the definition of ψ_α-norm, for j = 1,2,…,m, each individual X_j has

E (exp {(\frac{| X_{j} |}{K_{j}})}^{α}) \leq 2.

(A.7)

Putting (A.6) and (A.7) together, we obtain

E [exp {(| \frac{\prod_{j = 1}^{m} X_{j}}{\prod_{j = 1}^{m} K_{j}} |)}^{α / m}] = E [exp {(| \prod_{j = 1}^{m} \frac{X_{j}}{K_{j}} |)}^{α / m}] \leq \frac{1}{m} \sum_{j = 1}^{m} E [exp {(| \frac{X_{j}}{K_{j}} |)}^{α}] \leq 2.

Therefore, we conclude that the ψ_α/m-norm of $\prod_{j = 1}^{m} X_{j}$ is bounded by $\prod_{j = 1}^{m} K_{j}$ .

Proof of Lemma 1. Note that for any j = 1,2,…,m, the ψ₂-norm of $X_{j}^{⊤} β_{j}$ is bounded by ‖β_j‖₂ [59]. According to Lemma 8, the ψ_2/m-norm of $\prod_{j = 1}^{m} (X_{j}^{⊤} β_{j})$ is bounded by $\prod_{j = 1}^{m} {‖ β_{j} ‖}_{2}$ . Directly applying Lemma 7, we reach the conclusion.

Proof of Lemma 2. We first focus on the non-symmetric version and the proof follows three steps:

Truncate the first coordinate of x_1i, x_2i, x_3i by a carefully chosen truncation level;
Utilize the high-order concentration inequality in Lemma 20 at order three;
Show that the bias caused by truncation is negligible.

With slightly abuse of notations, we denote a, x, y etc. as their first coordinate of a, x, y etc. Without loss of generality, we assume p ≔ max{p₁,p₂,p₃}. By unitary invariance, we assume β₁ = β₂ = β₃ = e₁, where e₁ = (1,0,…,0)^⊤.

Then, it is equivalent to prove

{‖ M_{nsy} - E (M_{nsy}) ‖}_{s} = {‖ \frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{2 i} x_{3 i} x_{1 i} \circ x_{2 i} \circ x_{3 i} - e_{1} \circ e_{1} \circ e_{1} ‖}_{s} \leq C {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}}} + \sqrt{\frac{s log (p / s)}{n}}) .

Suppose $x_{1} ~ N (0, I_{p_{1}}), x_{2} ~ N (0, I_{p_{2}}), x_{3} ~ N (0, I_{p_{3}})$ and ${x_{1 i}, x_{2 i}, x_{3 i}}_{i = 1}^{n}$ are n independent samples of {x₁, x₂, x₃}. And define a bounded event $G_{n}$ for the first coordinate and its corresponding population version,

G_{n} = {max_{i} {| x_{1 i} |, | x_{2 i} |, | x_{3 i} |} \leq M},

G = {max {| x_{1} |, | x_{2} |, | x_{3} |} \leq M},

where M is a large constant to be specified later. Let $‖ M_{nsy} - E (M_{nsy}) ‖_{s}$ upper bounded by M₁ + M₂ where

M_{1} = {‖ \frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{2 i} x_{3 i} x_{1 i} \circ x_{2 i} \circ x_{3 i} - E (x_{1} x_{2} x_{3} x_{1} \circ x_{2} \circ x_{3} ∣ G) ‖}_{s}

and

M_{2} = {‖ E (x_{1} x_{2} x_{3} x_{1} \circ x_{2} \circ x_{3} ∣ G) - e_{1} \circ e_{1} \circ e_{1} ‖}_{s} .

We will prove M₂ that is negligible in terms of convergence rate of M₁.

Bounding M₁. For simplicity, we define $x_{1}^{'} = x_{1} | G$ , $x_{2}^{'} = x_{2} | G, x_{3}^{'} = x_{3} | G$ , and ${x_{1 i}^{'}, x_{2 i}^{'}, x_{3 i}^{'}}_{i = 1}^{n}$ are n independent samples of ${x_{1}^{'}, x_{2}^{'}, x_{3}^{'}}$ . According to the law of total probability, we have

ℙ (M_{1} \geq t) \leq ℙ (G_{n}^{c}) + ℙ (M_{11} \geq t),

where

M_{11} = {‖ \frac{1}{n} \sum_{i = 1}^{n} x_{1 i}^{'} x_{1 i}^{'} \circ x_{3 i}^{'} x_{2 i}^{'} \circ x_{i 1}^{'} x_{3 i}^{'} - E (x_{1}^{'} x_{1}^{'} \circ x_{2}^{'} x_{2}^{'} \circ x_{3}^{'} x_{3}^{'}) ‖}_{s} .

According to Lemma 22, the entry of $x_{1 i}^{'} x_{1 i}^{'}, x_{2 i}^{'} x_{2 i}^{'}, x_{3 i}^{'} x_{3 i}^{'}$ are sub-Gaussian random variable with ψ₂-norm M². Applying Lemma 20, we obtain

ℙ (M_{11} \geq C_{1} M^{6} δ_{n, s}) \leq \frac{1}{p},

where δ_n,s = ((slog(p/s))³/n²)^1/2 + (slog(p/s)/n)^1/2.

On the other hand,

ℙ (G_{n}^{c}) \leq 3 \sum_{i = 1}^{n} ℙ (| x_{1 i} | \geq M) \leq 3 n e^{1 - C_{2} M^{2}}

Putting the above bounds together, we obtain

ℙ (M_{1} \geq C_{1} M^{6} δ_{n, s}) \leq \frac{1}{p} + 3 n e^{1 - C_{2} M^{2}} .

By setting $M = 2 \sqrt{log n / C_{2}}$ , the bound of M₁ reduces to

ℙ (M_{1} \geq \frac{64 C_{1}}{C_{2}^{3}} δ_{n, s} {(log n)}^{3}) \leq \frac{1}{p} + \frac{3 e}{n^{3}} .

(A.8)

Bounding M₂. From the definitions of M₂ and sparse spectral norm,

M_{2} = {‖ E (x_{1} x_{2} x_{3} x_{1} \circ x_{2} \circ x_{3} ∣ G) - e_{1} \circ e_{1} \circ e_{1} ‖}_{s} = sup_{D (a, b)} | E (x_{1} x_{2} x_{3} (x_{1}^{⊤} a) (x_{2}^{⊤} b) (x_{3}^{⊤} c) ∣ G) - a_{1} b_{1} c_{1} | .

where

D = {‖ a ‖_{2} = ‖ b ‖_{2} = ‖ c ‖_{2} = 1, max {‖ a ‖_{0}, ‖ b ‖_{0}, ‖ c ‖_{0}} \leq s} .

Since x_1j is independent of x_1k for any j ≠ k, $E (x_{1} (x_{1}^{⊤} ϱ a) ∣ G) = E (x_{1}^{2} a_{1} ∣ G)$ . Similar results hold for x₂,x₃. Then we have

M_{2} = sup_{D} | a_{1} b_{1} c_{1} | | E (x_{1}^{2} x_{2}^{2} x_{3}^{2} ∣ G) - 1 | \leq E (x_{1}^{2} x_{2}^{2} x_{3}^{2} ∣ G) - 1 | = | E (x_{1}^{2} | | x_{1} ∣ \leq M) E (x_{2}^{2} | | x_{2} ∣ \leq M) E (x_{3}^{2} | | x_{3} ∣ \leq M) - 1 ∣ .

By the basic property of Gaussian random variable, we can show

1 \geq E (x_{i}^{2} | | x_{i} ∣ \leq M) \geq 1 - 2 M e^{- M^{2} / 2}, i = 1, 2, 3.

Plugging them into M₂, we have

M_{2} \leq | {(1 - 2 M e^{- M^{2} / 2})}^{3} - 1 | \leq | 12 M^{2} e^{- M^{2}} - 6 M e^{- M^{2} / 2} - 8 M^{3} e^{- 3 M^{2} / 2} | \leq | 26 M^{3} e^{- M^{2} / 2} |,

where the last inequality holds for a large M > 0. By the choice of $M = 2 \sqrt{log n / C_{2}}$ , we have $M_{2} \leq 208 / C_{2}^{3 / 2} {(log n)}^{\frac{3}{2}} / n^{2}$ for some constant C₂. When n is large, this rate is negligible comparing with (A.8)

Bounding M: We put the upper bounds of M₁ and M₂ together. After some adjustments for absolute constant, it suffices to obtain

M_{1} + M_{2} \leq C {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}}} + \sqrt{\frac{s log (p / s)}{n}}),

with probability at least 1 − 10/n³ − 1/p. This concludes the proof of non-symmetric part. The proof of symmetric part remains similar and thus is omitted here. ■

C. Proof of Theorem 2: Initialization Effect

Theorem 2 gives an approximation error upper bound for the sparse-tensor-decomposition-based initial estimator. In Step I of Section III-A, the original problem can be reformatted to a version of tensor denoising:

T_{s} = J^{*} + E, where E = T_{s} - E (T_{s}) .

(A.9)

The key difference between our model (A.9) and recent works [50, 27] is that E arises from empirical moment approximation, rather than the random observation noise considered in [50] and [27]. Next lemma gives an upper bound for the approximation error. The proof of Lemma 9 is deferred to Section I.

Lemma 9 (Approximation error of $T_{s}$ ). Recall that $E = T_{s} - E (T_{s})$ , where $T_{s}$ is defined in (III.6). Suppose Condition 4 is satisfied and s ≤ d ≤ Cs. Then

‖ E ‖_{s + d} \leq 2 C_{1} \sum_{k = 1}^{K} η_{k}^{*} (\sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}}} + \sqrt{\frac{s log (p / s)}{n}}) {(log n)}^{4},

(A.10)

with probability at least 1 − 5/n for some uniform constant C₁.

Next we denote the following quantity for simplicity,

γ = C_{2} min {\frac{R^{- 1}}{6} - \frac{\sqrt{K}}{s}, \frac{R^{- 1}}{4 \sqrt{5}} - \frac{2}{\sqrt{s}} {(1 + \sqrt{\frac{K}{s}})}^{2}},

(A.11)

where R is the singular value ratio, K is the CP-rank, s is the sparsity parameter, Γ is the incoherence parameter and C₂ is uniform constant.

Next lemma provides theoretical guarantees for sparse tensor decomposition method.

Lemma 10. Suppose that the symmetric tensor denoising model (A.9) satisfies Conditions 1, 2 and 3 (i.e., the identifiability, parameter space and incoherence). Assume the number of initializations $L \geq K^{C_{3} γ^{- 4}}$ and the number of iterations $N \geq C_{4} log (γ / (\frac{1}{η_{min}^{*}} ‖ E ‖_{s + d} + \sqrt{K} Γ^{2}))$ for constants C₃,C₄, the truncation parameter s ≤ d ≤ Cs. Then the sparse-tensor-decomposition-based initialization satisfies

max {{‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2}, | η_{k}^{(0)} - η_{k}^{*} |} \leq \frac{C_{4}}{η_{min}^{*}} ‖ E ‖_{s + d} + \sqrt{K} Γ^{2},

(A.12)

for any k ∈ [K]

The proof of Lemma 10 essentially follows Theorem 3.9 in [27], we thus omit the detailed proof here. The upper bound in (A.12) contains two terms: $\frac{C_{4}}{η_{min}^{*}} ‖ E ‖_{s + d}$ and $\sqrt{K} Γ^{2}$ , which are due to the empirical moment approximation and the incoherence among different β_k, respectively.

Although the sparse tensor decomposition is not optimal in statistical rate, it does offer a reasonable initial estimation provided enough samples. Equipped with (A.10) and Condition 2, the right side of (A.12) reduces to

\frac{C_{4}}{η_{min}^{*}} ‖ E ‖_{s + d} + \sqrt{K} Γ^{2} \leq 2 C_{1} C_{4} K R (\sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}}} + \sqrt{\frac{s log (p / s)}{n}}) {(log n)}^{4} + \sqrt{K} Γ^{2},

with probability at least 1−5/n. Denote C₀ = 4·2160·C₁C₄. Using Conditions 3 and 5, we reach the conclusion that

max {{‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2}, | η_{k}^{(0)} - η_{k}^{*} |} \leq K^{- 1} R^{- 2} / 2160,

with probability at least 1 − 5/n.

D. Proof of Theorem 1: Gradient Update

We first introduce the following lemma to illustrate the improvement of one step thresholded gradient update under suitable conditions. The error bound includes two parts: the optimization error that describes one step effect for gradient update, and the statistical error that reflects the random noise effect. The proof of Lemma 11 is given in Section J. For notation simplicity, we drop the superscript of ${β_{k}^{(t)}, η_{k}}$ in the following proof.

Lemma 11. Let t ≥ 0 be an integer. Suppose Conditions 1–5 hold and ${β_{k}^{(t)}, η_{k}}$ satisfies the following upper bound

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq 4 K η_{max}^{* \frac{2}{3}} ε_{0}^{2}, max_{k \in [K]} | η_{k} - η_{k}^{*} | \leq ε_{0},

(A.13)

with probability at least $1 - O (K / n)$ , where $ε_{0} = K^{- 1} R^{- \frac{4}{3}} / 2160$ . As long as the step size μ satisfies

0 < μ \leq μ_{0} = \frac{32 R^{- 20 / 3}}{3 K {[220 + 270 K]}^{2}},

(A.14)

then ${β_{k}^{(t + 1)}}$ can be upper bounded as

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t + 1)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq \underset{optimization error}{\underset{︸}{(1 - 32 μ K^{- 2} R^{- \frac{8}{3}}) \sum_{k = 1}^{n} {‖ \sqrt[3]{η_{k}} β_{k}^{(t)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2}}} + \underset{statistical error}{\underset{︸}{2 C_{0} μ^{2} K^{- 2} R^{- \frac{8}{3}} η_{min}^{* - \frac{4}{3}} \frac{σ^{2} s log p}{n}}},

with probability at least $1 - O (K s / n)$ .

In order to apply Lemma 11, we prove that the required condition (A.13) holds at every iteration step t by induction. When t = 0, by (IV.2) and Condition 2,

{‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2} \leq ε_{0}, | η_{k} - η_{k}^{*} | \leq ε_{0}, for k \in [K],

holds with probability at least $1 - O (1 / n)$ . Since the initial estimator output by first stage is normalized, i.e., ${‖ β_{k}^{(0)} ‖}_{2} = {‖ β_{k}^{*} ‖}_{2} = 1$ , by triangle inequality we have

{‖ \sqrt[3]{η_{k}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2} \leq {‖ \sqrt[3]{η_{k}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{(0)} + \sqrt[3]{η_{k}^{*}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2} \leq | \sqrt[3]{η_{k}} - \sqrt[3]{η_{k}^{*}} | + \sqrt[3]{η_{k}^{*}} {‖ β_{k}^{(0)} - β_{k}^{*} ‖}_{2} .

Note that

| \sqrt[3]{η_{k}} - \sqrt[3]{η_{k}^{*}} | \leq \frac{ε_{0}}{{(\sqrt[3]{η_{k}})}^{2} + \sqrt[3]{η_{k} η_{k}^{*}} + {(\sqrt[3]{η_{k}^{*}})}^{2}} \leq ε_{0} \sqrt[3]{η_{k}^{*}} .

This implies

{‖ \sqrt[3]{η_{k}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2} \leq 2 \sqrt[3]{η_{k}^{*}} ε_{0},

with probability at least $1 - O (1 / n)$ . Taking the summation over k ∈ [K], we have

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq \sum_{k = 1}^{K} 4 η_{k}^{* \frac{2}{3}} ε_{0}^{2} \leq 4 K η_{max}^{* \frac{2}{3}} ε_{0}^{2},

with probability at least $1 - O (K / n)$ , which means (A.13) holds for t = 0.

Suppose (A.13) holds at the iteration step t − 1, which implies

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq (1 - 32 μ K^{- 2} R^{- \frac{8}{3}}) \sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t - 1)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} + μ 2 C_{0} K^{- 2} R^{- \frac{8}{3}} η_{min}^{* \frac{4}{3}} \frac{σ^{2} s log p}{n} \leq 4 K η_{max}^{* \frac{2}{3}} ε_{0}^{2} - μ (128 K R^{- \frac{8}{3}} η_{max}^{* \frac{2}{3}} ε_{0}^{2} - 2 C_{0} K^{- 2} R^{- \frac{8}{3}} η_{min}^{* \frac{4}{3}} \frac{σ^{2} s log p}{n}) .

Since Condition 5 automatically implies

\frac{n}{s log p} \geq \frac{C_{0} σ^{2} R^{- \frac{2}{3}} η_{min}^{* \frac{2}{3}} K}{64 ε_{0}^{2}},

for a sufficiently large C₀, we can obtain

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq 4 K η_{max}^{* \frac{2}{3}} ε_{0}^{2} .

By induction, (A.13) holds at each iteration step.

Now we are able to use Lemma 11 recursively to complete the proof. Repeatedly using Lemma 11, we have for t = 1, 2, …,

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(t + 1)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq {(1 - 32 μ K^{- 2} R^{- \frac{8}{3}})}^{t} \sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{(0)} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} + \frac{C_{0} η_{min}^{* - \frac{4}{3}} σ^{2} s log p}{16},

with probability at least $1 - O (t K s / n)$ . This concludes the first part of Theorem 1.

When the total number of iterations is no smaller than

T^{*} = \frac{log (C_{3} η_{min}^{* - 4 / 3} σ^{2} s log p) - log (64 η_{max}^{* 2 / 3} K ε_{0} n)}{log (1 - 32 μ K^{- 2} R^{- 8 / 3})},

the statistical error will dominate the whole error bound in the sense that

\sum_{k = 1}^{K} ‖ \sqrt[3]{η_{k}} β_{k}^{(T^{*})} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖_{2}^{2} \leq \frac{C_{3} η_{min}^{* - \frac{4}{3}}}{8} \frac{σ^{2} s log p}{n},

(A.15)

with probability at least $1 - O (T^{*} K s / n)$ .

The next lemma shows that the Frobenius norm distance between two tensors can be bounded by the distances between each factors in their CP decomposition. The proof of this lemma is provided in Section K.

Lemma 12. Suppose $J$ and $J^{*}$ have CP-decomposition $J = \sum_{k = 1}^{K} η_{k} β_{k} \circ β_{k} \circ β_{k}$ and $J^{*} = \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*}$ . If $| η_{k} - η_{k}^{*} | \leq c$ , then

{‖ J - J^{*} ‖}_{F}^{2} \leq 9 (1 + c) (\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2}) (\sum_{k = 1}^{K} {(\sqrt[3]{η_{k}^{*}})}^{4})

Denote $\hat{J} = \sum_{k = 1}^{K} η_{k} β_{k}^{(T^{*})} \circ β_{k}^{(T^{*})} \circ β_{k}^{(T^{*})}$ . Combing (A.15) and Lemma 12, we have

‖ \hat{J} - J^{*} ‖_{F}^{2} \leq 9 (1 + ε_{0}) \frac{C_{3} η_{min}^{* \frac{4}{3}}}{8} \frac{σ^{2} s log p}{n} K η_{max}^{* \frac{4}{3}}, = \frac{9 C_{3} R}{4} \frac{σ^{2} K s log p}{n},

with probability at least $1 - O (T K s / n)$ . By setting C₁ = 9C₂/4, we complete the proof of Theorem 1.

E. Proofs of Theorems 4 and 6: Minimax Lower Bounds

We first consider the proof for Theorem 6 on non-symmetric tensor estimation. Without loss of generality we assume p = max{p₁,p₂,p₃}. We uniformly randomly generate ${Ω^{(k, m)}}_{\begin{matrix} m = 1, \dots, M \\ k = 1, \dots, K \end{matrix}}$ as MK subsets of {1,…,p} with cardinality of s. Here M > 0 is a large integer to be specified later. Then we construct ${β^{(k, m)}}_{\begin{matrix} m = 1, \dots, M \\ k = 1, \dots, K \end{matrix}} \subseteq ℝ^{p}$ as

β_{j}^{(k, m)} = {\begin{array}{l} \sqrt{λ}, & if j \in Ω^{(k, m)}; \\ 0, & if j \notin Ω^{(k, m)} . \end{array}

λ > 0 will also be specified a little while later. Clearly, ${‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq 2 s λ$ for any 1 ≤ k ≤ K, 1 ≤ m₁,m₂ ≤ M hyper-geometric distribution: $ℙ (| Ω^{(k, m_{1})} \cap Ω^{(k, m_{2})} | = t) = \frac{(\begin{array}{l} a \\ t \end{array}) (\begin{array}{l} p - a \\ a - t \end{array})}{(\begin{array}{l} p \\ a \end{array})}$ .

Let

w^{(k, m_{1}, m_{2})} = | Ω^{(k, m_{1})} \cap Ω^{(k, m_{2})} |,

(A.16)

then for any s/2 ≤ t ≤ s,

ℙ (w^{(k, m_{1}, m_{2})} = t) = \frac{\frac{s \dots (s - t + 1)}{t!} \cdot \frac{(p - s) \dots (p - 2 s + t + 1)}{(s - t)!}}{\frac{p \dots (p - s + 1)}{s!}} \leq (\begin{array}{l} s \\ t \end{array}) \cdot {(\frac{s}{p - s + 1})}^{t} \leq 2^{s} {(\frac{s}{p - s + 1})}^{t} \leq {(\frac{4 s}{p - s + 1})}^{t} .

Thus, if η > 0, the moment generating function of $w^{(k, m_{1}, m_{2})} - \frac{s}{2}$ satisfies

E exp (η (w^{(k, m_{1}, m_{2})} - \frac{s}{2})) \leq exp (0) \cdot ℙ (w^{(k, m_{1}, m_{2})} \leq \frac{s}{2}) + \sum_{t = ⌊ s / 2 ⌋ + 1}^{s} exp (η (t - \frac{s}{2})) \cdot ℙ (w^{(k, m_{1}, m_{2})} = t) \leq 1 + \sum_{t = ⌊ s / 2 ⌋ + 1}^{s} (4 s / (p - s + 1))^{t} exp (η (t - s / 2)) = 1 + {(\frac{4 s}{p - s + 1})}^{⌊ s / 2 ⌋ + 1} \sum_{t = 0}^{s - ⌊ s / 2 ⌋ - 1} {(\frac{4 s}{p - s + 1})}^{t} exp (η (t + ⌊ s / 2 ⌋ + 1 - s / 2)) \overset{(*)}{\leq} 1 + (\frac{4 s}{p - s + 1}) \sum_{t = 0}^{s / 2 s - ⌊ s / 2 ⌋ - 1} {(\frac{4 s e^{η}}{p - s + 1})}^{t} = 1 + {(\frac{4 s}{p - s + 1})}^{s / 2} \frac{1 - {(4 s e^{η} / (p - s + 1))}^{s - ⌊ s / 2 ⌋}}{1 - 4 s e^{η} / (p - s + 1)} < 1 + {(4 s / (p - s + 1))}^{s / 2} \frac{1}{1 - 4 s / (p - s + 1) \cdot e^{η}} .

Here, (*) is due to η > 0 and ⌊s/2⌋ + 1 ≥ s/2. By setting η = log((p − s + 1)/(8s)), we have

ℙ (\sum_{k = 1}^{K} w^{(k, m_{1}, m_{2})} \geq \frac{3 s K}{4}) = ℙ (\sum_{k = 1}^{K} w^{(k, m_{1}, m_{2})} - \frac{s K}{2} \geq \frac{s K}{4}) \leq \frac{E exp (η (\sum_{k = 1}^{K} w^{(k, m_{1}, m_{2})} - \frac{s K}{2}))}{exp (η \cdot \frac{s K}{4})} = \frac{\prod_{k = 1}^{K} E exp (η (w^{(k, m_{1}, m_{2})} - \frac{s}{2}))}{exp (η \cdot \frac{s K}{4})} \leq {(1 + {(4 s / (p - s + 1))}^{s / 2} \cdot 2)}^{K} * exp (- \frac{s K}{4} log (\frac{p - s + 1}{8 s})) .

(A.17)

Since p ≥ 20s and s ≥ 4, we have

{(1 + 2 {(4 s / (p - s + 1))}^{s / 2})}^{K} \leq exp (K log (1 + 2 {(\frac{4}{p / s - 1})}^{s / 2})) \leq exp (K log (1 + 2 {(\frac{4}{19})}^{2})) \leq exp (K \cdot 0.085) \leq exp (s K log (p / s) \cdot 0.0144), exp (- \frac{s K}{4} log (\frac{p - s + 1}{8 s})) = exp (- \frac{s K log (p / s)}{4} + \frac{s K}{4} log (8 p / (p - s + 1))) \leq exp (- \frac{s K log (p / s)}{4} + \frac{s K}{4} log (8 \cdot 19 / 20)) \leq exp (- s K log (p / s) \cdot 0.08) .

Combining the two inequalities above, we have.

{(1 + {(4 s / (p - s + 1))}^{s / 2} \cdot 2)}^{K} exp (- \frac{s K}{4} log (\frac{p - s + 1}{8 s})) \leq exp (- c_{0} s K log (p / s))

for c₀ = 1/20.

Next we choose M = ⌊exp(c₀/2 · sK log(p/s))⌋. Note that

{‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} = λ \cdot (| Ω^{(k, m_{1})} \ Ω^{(k, m_{2})} | + | Ω^{(k, m_{2})} \ Ω^{(k, m_{1})} |) = λ (| Ω^{(k, m_{1})} | + | Ω^{(k, m_{2})} | - 2 | Ω^{(k, m_{1})} \cap Ω^{(k, m_{2})} |) = 2 λ (s - | Ω^{(k, m_{1})} \cap Ω^{(k, m_{2})} |) \overset{(A .16)}{=} 2 λ (s - w^{(k, m_{1}, m_{2})}),

then we further have

ℙ (\sum_{k = 1}^{K} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \geq \frac{s K λ}{2}, \forall 1 \leq m_{1} < m_{2} \leq M) = ℙ (\sum_{k = 1}^{K} 2 λ (s - w^{(k, m_{1}, m_{2})}) \geq \frac{s K λ}{2}, \forall 1 \leq m_{1} < m_{2} \leq M) = ℙ (\sum_{k = 1}^{K} w^{(k, m_{1}, m_{2})} \leq \frac{3 K}{4}, \forall 1 \leq m_{1} < m_{2} \leq M) \overset{(A .17)}{\geq} 1 - \frac{M (M - 1)}{2} exp (- c_{0} s K log (p / s)) > 1 - M^{2} exp (- c_{0} s K log (p / s)) \geq 0,

which means there are positive probability that ${β^{(k, m)}}_{\begin{matrix} k = 1, \dots, K \\ m = 1, \dots, M \end{matrix}}$ satisfy

\frac{s K λ}{2} \leq min_{1 \leq m_{1} < m_{2} \leq M} \sum_{k = 1}^{K} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq max_{1 \leq m_{1} < m_{2} \leq M} \sum_{k = 1}^{K} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq 2 s K λ .

(A.18)

For the rest of the proof, we fix ${β^{(k, m)}}_{\begin{matrix} k = 1, \dots, K \\ m = 1, \dots, M \end{matrix}}$ to be the set of vectors satisfying (A.18).

Next, recall the canonical basis $e_{k} = (0, \dots, \overset{k - th}{\overset{︷}{1}}, 0, \dots, 0) \in ℝ^{p}$ . Define

J^{(m)} = \sum_{k = 1}^{K} β^{(k, m)} \circ e_{k} \circ e_{k}, 1 \leq m \leq M .

For each tensor $J^{(m)}$ and n i.i.d. Gaussian sketches $u_{i}, v_{i}, w_{i} \in ℝ^{p}$ , we denote the response

y^{(m)} = {y_{i}^{(m)}}_{i = 1}^{n}, y_{i}^{(m)} = u_{i} \circ v_{i} \circ w_{i}, T^{(m)} + ϵ_{i},

where $ϵ_{i} \overset{i i d}{~} N (0, σ^{2})$ . Clearly, y (y^(m), u, v, w) follows a joint distribution, which may vary based on different values of m.

In this step, we analyze the Kullback-Leibler divergence between different distribution pairs:

D_{K L} ((y^{(m_{1})}, u, v, w), (y^{(m_{2})}, u, v, w)) : = E_{(y^{(m_{1})}, u, v, w)} log (\frac{p (y^{(m_{1})}, u, v, w)}{p (y^{(m_{2})}, u, v, w)}) .

Note that conditioning on fixed values of u, v, w,

y_{i}^{(m)} ~ N (\sum_{k = 1}^{K} (β^{(k, m) ⊤} u_{i}) \cdot (e^{(k) ⊤} v_{i}) \cdot (e^{(k) ⊤} w_{i}), σ^{2}) .

By the KL-divergence formula for Gaussian distribution,

E_{(y^{(m_{1})}, u, v, w)} (\frac{p (y^{(m_{1})}, u, v, w)}{p (y^{(m_{2})}, u, v, w)} ∣ u, v, w) = \frac{1}{2} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} ({(β^{(k, m_{1})} - β^{(k, m_{2})})}^{⊤} u_{i}) * {(e^{(k) ⊤} v_{i}) (e^{(k) ⊤} w_{i}))}^{2} σ^{- 2} .

Therefore, for any m₁ ≠ m₂,

D_{K L} ((y^{(m_{1})}, u, v, w), (y^{(m_{2})}, u, v, w)) = E_{u, v, w} \frac{1}{2} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} {(β^{(k, m_{1})} - β^{(k, m_{2})})}^{⊤} u_{i}) {(e^{(k) ⊤} v_{i}) (e^{(k) ⊤} w_{i}))}^{2} σ^{- 2} = \frac{σ^{- 2}}{2} \sum_{i = 1}^{n} \sum_{k = 1}^{K} E_{u} {({(β^{(k, m_{1})} - β^{(k, m_{2})})}^{⊤} u_{i})}^{2} E_{v} {(e^{(k) ⊤} v_{i})}^{2} E_{w} {(e^{(k) ⊤} w_{i})}^{2} = \frac{n σ^{- 2}}{2} \sum_{k = 1}^{K} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq σ^{- 2} n K s λ .

Meanwhile, for any 1 ≤ m₁ < m₂ ≤ M,

{‖ J^{(m_{1})} - J^{(m_{2})} ‖}_{F} = {‖ \sum_{k = 1}^{K} (β^{(k, m_{1})} - β^{(k, m_{2})}) \circ e^{(k)} \circ e^{(k)} ‖}_{F} = \sqrt{\sum_{k = 1}^{K} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2}} \geq \sqrt{\frac{s K λ}{2}} .

By generalized Fano’s Lemma (see, e.g., [60]),

inf_{\hat{J}} sup_{J \in F} E ‖ \hat{J} - J ‖_{F} \geq \sqrt{\frac{s K λ}{2}} (1 - \frac{σ^{- 2} n K s λ + log 2}{log M}) .

Finally we set $λ = \frac{c σ^{2}}{n} log (p / s)$ for some small constant c > 0, then

inf_{\hat{J}} sup_{J \in F} E ‖ \hat{T} - J ‖_{F}^{2} \geq {(inf_{\hat{J}} sup_{J \in F} E ‖ \hat{J} - J ‖_{F})}^{2} \geq \frac{c σ^{2} s K log (p / s)}{n} .

which has finished the proof of Theorem 6.

For the proof for Theorem 4, without loss of generality we assume K is a multiple of 3. We first partition {1,…,p} into two subintervals: I₁ = {1,…,p−K/3},I₂ = {p−K/3+1,…,p}, randomly generate ${Ω^{(k, m)}}_{\begin{matrix} m = 1, \dots, M \\ k = 1, \dots, K / 3 \end{matrix}} \subseteq ℝ^{p}$ as (MK/3) subsets of {1, …, p – K/3} and construct ${β^{(k, m)}}_{\begin{matrix} m = 1, \dots, M \\ k = 1, \dots, K \end{matrix}} \subseteq ℝ^{p - K / 3}$ as

β^{(k, m)} = {\begin{array}{l} \sqrt{λ}, & if j \notin Ω^{(k, m)}; \\ 0, & if j \notin Ω^{(k, m)} . \end{array}

With M = exp(csK log(p/s)) and similar techniques as previous proof, one can show there exists positive possibility that

\frac{s K λ}{6} \leq min_{1 \leq m_{1} < m_{2} \leq M} \sum_{k = 1}^{K / 3} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq max_{1 \leq m_{1} < m_{2} \leq M} \sum_{k = 1}^{K / 3} {‖ β^{(k, m_{1})} - β^{(k, m_{2})} ‖}_{2}^{2} \leq \frac{2 s K}{3} λ .

We then construct the following candidate symmetric tensors by blockwise design,

T^{(m)} = {\begin{array}{l} T_{[I_{1}, I_{2}, I_{2}]}^{(m)} = \sum_{k = 1}^{K / 3} β^{(k, m)} \circ e^{(k)} \circ e^{(k)}, \\ T_{[I_{2}, I_{1}, I_{2}]}^{(m)} = \sum_{k = 1}^{K / 3} e^{(k)} \circ β^{(k, m)} \circ e^{(k)}, \\ T_{[I_{2}, I_{2}, I_{1}]}^{(m)} = \sum_{k = 1}^{K / 3} e^{(k)} \circ e^{(k)} \circ β^{(k, m)}, \\ T_{[I_{1}, I_{1}, I_{1}]}^{(m)}, T_{[I_{1}, I_{1}, I_{2}]}^{(m)}, \\ T_{[I_{1}, I_{2}, I_{1}]}^{(m)}, T_{[I_{2}, I_{1}, I_{1}]}^{(m)}, T_{[I_{2}, I_{2}, I_{2}]}^{(m)} are all zeros. \end{array}

Then we can see for any $u \in ℝ^{p}$ ,

〈 T^{(m)}, u \circ u \circ u 〉 = 3 \sum_{k = 1}^{K / 3} (β^{(k, m) ⊤} u_{I_{1}}) \cdot {(e^{(k) ⊤} u_{I_{2}})}^{2} .

The rest of the proof essentially follows from the proof of Theorem 6. ■

F. Proof of Theorem 7: High-order Stein’s Lemma

The proof of this theorem follows from the one of Theorem 6 in [49]. For the sake of completeness, we restate the detail here. Applying the recursion representation of score function (A.2), we have

E [G (x) S_{3} (x)] = E [G (x) (- S_{2} (x) \circ \nabla_{x} log p (x) - \nabla_{x} S_{2} (x))] = - E [G (x) S_{2} (x) \circ \nabla_{x} log p (x)] - E [G (x) \nabla_{x} S_{2} (x))] .

Then, we apply the first-order Stein’s lemma (see Lemma 26) on function $G (x) S_{2} (x)$ and obtain

E [G (x) S_{3} (x)] = E [\nabla_{x} (G (x) S_{2} (x))] - E [G (x) \nabla_{x} S_{2} (x))] = E [\nabla_{x} G (x) S_{2} (x) + \nabla_{x} S_{2} (x) G (x)] - E [G (x) \nabla_{x} S_{2} (x))] = E [\nabla_{x} G (x) S_{2} (x)] .

Repeating the above argument two more times, we reach the conclusion. ■

G. Proofs of Lemmas 3, 4, and 5: Moment Calculation

In this subsection, we present the detail proofs of moment calculation, including non-symmetric case, symmetric case, and interaction model.

1). Proof of Lemma 3:

By the definition of {y_i} in (VI.1) & (VI.2), we have

E (\frac{1}{n} \sum_{i = 1}^{n} y_{i} u_{i} \circ v_{i} \circ w_{i}) = E (\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} u_{i} \circ v_{i} \circ w_{i}) + E (\frac{1}{n} \sum_{i = 1}^{n} \sum_{l_{k - 1}}^{K} η_{k}^{*} (β_{1 k}^{* ⊤} u_{i}) (β_{2 k}^{* ⊤} v_{i}) (β_{3 k}^{* ⊤} w_{i}) u_{i} \circ v_{i} \circ w_{i}) .

(A.19)

First, we observe $E (ϵ_{i} u_{i} \circ v_{i} \circ w_{i}) = 0$ due to the independence between ϵ_i and {u_i,v_i,w_i}. Then, we consider a single component from a single observation

M = E ((β_{1 k}^{* ⊤} u_{i}) (β_{2 k}^{* ⊤} v_{i}) (β_{3 k}^{* ⊤} w_{i}) u_{i} \circ v_{i} \circ w_{i}), i \in [n], k \in [K] .

For notation simplicity, we drop the subscript i for i-th observation and k for k-th component such that

M = E ((β_{1}^{* ⊤} u) (β_{2}^{* ⊤} v) (β_{3}^{* ⊤} w) u \circ v \circ w) \in ℝ^{p_{1} \times p_{2} \times p_{3}} .

(A.20)

Each entry of M can be calculated as follows

M_{i j k} = E ((β_{1}^{* ⊤} u) (β_{2}^{* ⊤} v) (β_{3}^{* ⊤} w) u_{i} v_{j} w_{k}) = E ((β_{1 i}^{*} u_{i} + \sum_{m \neq i} β_{1 m}^{*} u_{m}) u_{i}) \times E ((β_{2 j}^{*} u_{i} + \sum_{m \neq j} β_{2 m}^{*} v_{m}) v_{j}) \times E ((β_{3 k}^{*} w_{k} + \sum_{m \neq k} β_{3 m}^{*} w_{m}) w_{k}) = β_{1 i}^{*} β_{2 j}^{*} β_{3 k}^{*},

which implies M = β₁ ∘ β₂ ∘ β₃. Combining with n observations and K components, we can obtain

E (T) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{k = 1}^{K} η_{k}^{*} β_{1 k} \circ β_{2 k} \circ β_{3 k} .

This finished our proof. ■

2). Proof of Lemma 4:

In this subsection, we provide an alternative and more direct proof for Lemma 4. We consider a similar single component of (A.20) but with a symmetric structure, namely, $M_{s} = E ({(β^{*}^{⊤} x)}^{3} x \circ x \circ x)$ . Based on the symmetry of both underlying tensor and sketchings, we will verify the following three cases:

When i = j = k, then
$M_{s_{i i i}} = E {(β_{i}^{*} x_{i} + \sum_{m \neq i} β_{m}^{*} x_{m})}^{3} x_{i}^{3} = E (β_{i}^{* 3} x_{i}^{3} + 3 β_{i}^{* 2} x_{i}^{2} (\sum_{m \neq i} β_{m}^{*} x_{m}) + 3 β_{i}^{*} x_{i} {(\sum_{m \neq i} β_{m}^{*} x_{m})}^{2} + {(\sum_{m \neq i} β_{m}^{*} x_{m})}^{3}) x_{i}^{3} = 15 β_{i}^{* 3} + 9 β_{i}^{*} \sum_{m \neq i} β_{m}^{* 2} = 9 β_{i}^{*} + 6 β_{i}^{* 3} .$
The last equation is due to ‖β*‖₂ = 1.
When i ≠ j ≠ k, then
$M_{s_{i j k}} = E {(β_{i}^{*} x_{i} + β_{j}^{*} x_{j} + β_{k}^{*} x_{k})}^{3} x_{i} x_{j} x_{k} = 6 β_{i}^{*} β_{j}^{*} β_{k}^{*} .$
When i = j ≠ k, then
$M_{s_{i i k}} = E {(β_{i}^{*} x_{i} + β_{k}^{*} x_{k} + \sum_{m \neq i, k} β_{m}^{*} x_{m})}^{3} x_{i}^{2} x_{k} = 9 β_{i}^{* 2} β_{k}^{*} + 3 β_{k}^{* 3} + 3 β_{k}^{*} (\sum_{m \neq i, k} β_{m}^{* 2}) = 9 β_{i}^{* 2} β_{k}^{*} + 3 β_{k}^{*} (\sum_{m \neq i} β_{m}^{* 2}) = 3 β_{k}^{*} + 6 β_{i}^{* 2} β_{k}^{*} .$

Therefore, it is sufficient to calculate M_s by

M_{s} = 3 \sum_{k = 1}^{K} η_{k}^{*} (\sum_{m = 1}^{p} β_{k}^{*} \circ e_{m} \circ e_{m} + e_{m} \circ β_{k}^{*} \circ e_{m} + e_{m} \circ e_{m} \circ β_{k}^{*}) + 6 \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} ° β_{k}^{*} ° β_{k}^{*} .

The first term is the bias term due to correlations among symmetric sketchings. Denote $M_{1} = \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i}$ and note that $E (\frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i}) = 3 \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*}$ Therefore, the empirical first-order moment M₁ could be used to remove the bias term as follows

E (M_{s} - \sum_{m = 1}^{p} (M_{1} \circ e_{m} \circ e_{m} + e_{m} \circ M_{1} \circ e_{m} + e_{m} \circ e_{m} \circ M_{1})) = 6 \sum_{k = 1}^{K} η_{k}^{*} β_{k}^{*} \circ β_{k}^{*} \circ β_{k}^{*} .

This finishes our proof. ■

3). Proof of Lemma 5:

As before, consider a single component first. For notation simplicity, we drop the subscript l for l-th observation and k for k-th component. Since each component is normalized, the entry-wise expectation of (β^⊤x)³ x ∘ x ∘ x can be calculated as

\begin{array}{l} {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{0, 0, 0} = 3 β_{0} - 2 β_{0}^{3} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{0, 0, i} = 3 β_{i} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{0, i, i} = 6 β_{0} β_{i}^{2} + 3 β_{0} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{0, i, j} = 6 β_{0} β_{i} β_{j} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{i, i, i} = 6 β_{i}^{3} + 9 β_{i} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{i, i, j} = 6 β_{i}^{2} β_{j} + 3 β_{j} \\ {[E {(β^{⊤} x)}^{3} x \circ x \circ x]}_{i, j, k} = 6 β_{i} β_{j} β_{k} . \end{array}

Due to the symmetric structure and non-randomness of first coordinate, there are bias appearing for each entry. For i,j,k ≠ 0, we could use $\sum_{m = 1}^{p} (a \circ e_{m} \circ e_{m} + e_{m} \circ a \circ e_{m} + e_{m} \circ e_{m} \circ a)$ to remove the bias as shown in the previous proof of Lemma 4. For the subscript involving 0, the following two calculations work for removing the bias,

\begin{array}{l} E (\frac{1}{3} T_{s} - \frac{1}{6} (\sum_{k = 1}^{p} T_{s, [k, k, i]} - (p + 1) a_{i})) = β_{0}^{2} β_{i} . \\ E (\frac{1}{2 p - 2} (\sum_{k = 1}^{p} T_{s [0, k, k]} - (p + 2) T_{s [0, 0, 0]})) = β_{0}^{3} . \end{array}

This ends the proof. ■

H. Proof of Lemma 6

Recall the ‖X‖_ψα is defined in Definition 1. Without loss of generality, we assume ‖X‖_ψα = 1 and $E X_{i} = 0$ throughout this proof. Let β = (log2)i/α and Z_i = (|X_i| = − β)₊, where (x)₊ = x if x ≥ 0 and (x)₊ = 0 if else. For notation simplicity, we define $‖ X ‖_{p} = {(E | X |^{p})}^{1 / p}$ for a random variable X. The following step is to estimate the moment of linear combinations of variables ${X_{i}}_{i = 1}^{n}$ .

According to the symmetrization inequality (e.g., Proposition 6.3 of [61]), we have

{‖ \sum_{i = 1}^{n} a_{i} X_{i} ‖}_{p} \leq 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} X_{i} ‖}_{p} = 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} | X_{i} | ‖}_{p},

(A.21)

where ${ε_{i}}_{i = 1}^{n}$ are independent Rademacher random variables and we notice that ε_iX_i and ε_i|X_i| are identically distributed. Moreover, if |X_i| ≥ β, the definition of Z_i implies that |X_i| = Z_i + β. And if |X_i| < β, we have Z_i = 0. Thus, we have |X_i| ≤ Z_i + β at any time and it leads to

2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} | X_{i} | ‖}_{p} \leq 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} (β + Z_{i}) ‖}_{p} .

(A.22)

By triangle inequality,

2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} (β + Z_{i}) ‖}_{p} \leq 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} Z_{i} ‖}_{p} + 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} β ‖}_{p} .

(A.23)

Next, we will bound the second term of the RHS of (A.23). In particular, we will utilize Khinchin-Kahane inequality, whose formal statement is included in Lemma 27 for the sake of completeness. From Lemma 27 we have

{‖ \sum_{i = 1}^{n} a_{i} ε_{i} β ‖}_{p} \leq {(\frac{p - 1}{2 - 1})}^{1 / 2} {‖ \sum_{i = 1}^{n} a_{i} ε_{i} β ‖}_{2} \leq β \sqrt{p} {‖ \sum_{i = 1}^{n} a_{i} ε_{i} ‖}_{2} .

(A.24)

Since ${ε_{i}}_{i = 1}^{n}$ are independent Rademacher random variables, some simple calculations implies

{(E {(\sum_{i = 1}^{n} ε_{i} a_{i})}^{2})}^{1 / 2}

(A.25)

\begin{array}{l} = {(E (\sum_{i = 1}^{n} ε_{i}^{2} a_{i}^{2} + 2 \sum_{1 \leq i < j \leq n} ε_{i} ε_{j} a_{i} a_{j}))}^{1 / 2} \\ = {(\sum_{i = 1}^{n} a_{i}^{2} E ε_{i}^{2} + 2 \sum_{1 \leq i < j \leq n} a_{i} a_{j} E ε_{i} E ε_{j})}^{1 / 2} \\ = {(\sum_{i = 1}^{n} a_{i}^{2})}^{1 / 2} = ‖ a ‖_{2} . \end{array}

(A.26)

Combining inequalities (A.22)–(A.25),

2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} | X_{i} | ‖}_{p} \leq 2 {‖ \sum_{i = 1}^{n} a_{i} ε_{i} Z_{i} ‖}_{p} + 2 β \sqrt{p} ‖ a ‖_{2} .

(A.27)

Let ${Y_{i}}_{i = 1}^{n}$ are independent symmetric random variables satisfying $ℙ (| Y_{i} | \geq t) = exp (- t^{α})$ for all t ≥ 0. Then we have

\begin{array}{l} ℙ (Z_{i} \geq t) \\ \leq ℙ (| X_{i} | \geq t + β) = ℙ (exp ({| X_{i} |}^{α}) \geq exp ({(t + β)}^{α})) \\ \leq E (exp {(| X_{i} |)}^{α}) \cdot exp (- {(t + β)}^{α}) \leq 2 exp (- {(t + β)}^{α}) \\ \leq 2 exp (- t^{α} - β^{α}) = ℙ (| Y_{i} | \geq t), \end{array}

which implies

{‖ \sum_{i = 1}^{n} a_{i} ε_{i} Z_{i} ‖}_{p} \leq {‖ \sum_{i = 1}^{n} a_{i} ε_{i} Y_{i} ‖}_{p} = {‖ \sum_{i = 1}^{n} a_{i} Y_{i} ‖}_{p},

(A.28)

since ε_iY_i and Y_i have the same distribution due to symmetry. Combining (A.27) and (A.28) together, we reach

{‖ \sum_{i = 1}^{n} a_{i} X_{i} ‖}_{p} \leq 2 β \sqrt{p} ‖ a ‖_{2} + 2 {‖ \sum_{i = 1}^{n} a_{i} Y_{i} ‖}_{p} .

(A.29)

For 0 < α < 1, it follows Lemma 25 that

{‖ \sum_{i = 1}^{n} a_{i} Y_{i} ‖}_{p} \leq C_{1} (α) (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty}),

(A.30)

where C₁(α) is some absolute constant only depending on α.

For α ≥ 1, we will combine Lemma 24 and the method of the integration by parts to pass from tail bound result to moment bound result. Recall that for every non-negative random variable X, integration by parts yields the identity

E X = \int_{0}^{\infty} ℙ (X \geq t) d t .

Applying this to $X = {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p}$ and changing the variable t = t^p, then we have

\begin{array}{l} E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p} \\ = \int_{0}^{\infty} ℙ (| \sum_{i = 1}^{n} a_{i} Y_{i} | \geq t) p t^{p - 1} d t \\ \leq \int_{0}^{\infty} 2 exp (- c min (\frac{t^{2}}{‖ a ‖_{2}^{2}}, \frac{t^{α}}{‖ a ‖_{α *}^{α}})) p t^{p - 1} d t, \end{array}

(A.31)

where the inequality is from Lemma 24 for all p ≥ 2 and 1/α + 1/α* = 1. In this following, we bound the integral in three steps:

If $\frac{t^{2}}{‖ a ‖_{2}^{2}} \leq \frac{t^{α}}{‖ a ‖_{α *}^{α}}$ , (A.31) reduces to
$E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p} \leq 2 p \int_{0}^{\infty} exp (- c \frac{t^{2}}{‖ a ‖_{2}^{2}})) t^{p - 1} d t .$
Letting $t^{'} = c t^{2} / ‖ a ‖_{2}^{2}$ , we have
$\begin{array}{l} 2 p \int_{0}^{\infty} exp (- c \frac{t^{2}}{‖ a ‖_{2}^{2}})) t^{p - 1} d t \\ = \frac{p ‖ a ‖_{2}^{p}}{c^{p / 2}} \int_{0}^{\infty} e^{- t^{'}} t^{' p / 2 - 1} d t^{'} \\ = \frac{p ‖ a ‖_{2}^{p}}{c^{p / 2}} Γ (\frac{p}{2}) \leq \frac{p ‖ a ‖_{2}^{p}}{c^{p / 2}} {(\frac{p}{2})}^{p / 2}, \end{array}$
where the second equation is from the density of Gamma random variable. Thus,
${(E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p})}^{\frac{1}{p}} \leq \frac{p^{1 / p}}{{(2 c)}^{1 / 2}} \sqrt{p} ‖ a ‖_{2} \leq \frac{\sqrt{2}}{\sqrt{c}} \sqrt{p} ‖ a ‖_{2} .$ (A.32)
If $\frac{t^{2}}{‖ a ‖_{2}^{2}} > \frac{t^{α}}{‖ a ‖_{α *}^{α}}$ , (A.31) reduces to
$E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p} \leq 2 p \int_{0}^{\infty} exp (- c \frac{t^{α}}{‖ a ‖_{α *}^{α}})) t^{p - 1} d t .$
Letting $t^{'} = c t^{α} / ‖ a ‖_{α *}^{α}$ , we have
$\begin{array}{l} 2 p \int_{0}^{\infty} exp (- c \frac{t^{α}}{‖ a ‖_{α *}^{α}})) t^{p - 1} d t \\ = \frac{2 p ‖ a ‖_{α *}^{p}}{α c^{p / α}} \int_{0}^{\infty} e^{- t^{'}} t^{' p / α - 1} d t^{'} \\ = \frac{2}{α} \frac{p ‖ a ‖_{α *}^{p}}{c^{p / α}} Γ (\frac{p}{α}) \leq \frac{2 p ‖ a ‖_{α *}^{p}}{α^{p / α}} {(\frac{p}{α})}^{p / α} . \end{array}$
Thus,
${(E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p})}^{\frac{1}{p}} \leq \frac{2 p^{1 / p}}{{(c α)}^{1 / α}} p^{1 / α} ‖ a ‖_{α *} \leq \frac{4}{{(c α)}^{1 / α}} p^{1 / α} ‖ a ‖_{α *} .$ (A.33)
Overall, we have the following by combining (A.32) and (A.33),
${(E {| \sum_{i = 1}^{n} a_{i} Y_{i} |}^{p})}^{\frac{1}{p}} \leq max (\sqrt{\frac{2}{c}}, \frac{4}{{(c α)}^{1 / α}}) (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{α *}) .$
After denoting $C_{2} (α) = max (\sqrt{\frac{2}{c}}, \frac{4}{{(c α)}^{1 / α}})$ , we reach
${‖ \sum_{i = 1}^{n} a_{i} Y_{i} ‖}_{p} \leq C_{2} (α) (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{α *}) .$ (A.34)

Since 0 < β < 1, the conclusion can be reached by combining (A.29),(A.30) and (A.34). ■

I. Proof of Lemma 9

Firstly, let us consider the non-symmetric perturbation error analysis. According to Lemma 3, the exact form of $E = T - E (T)$ is given by

E = \frac{1}{n} \sum_{i = 1}^{n} y_{i} u_{i} \circ v_{i} \circ w_{i} - \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} .

We decompose it by a concentration term $(E_{1})$ and a noise term $(E_{2})$ as follows,

E = E_{1} + E_{2},

(A.35)

where

E_{1} = \frac{1}{n} \sum_{i = 1}^{n} 〈 u_{i} \circ v_{i} \circ w_{i}, \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} 〉 u_{i} \circ v_{i} \circ w_{i} - \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} .

E_{2} = \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} u_{i} \circ v_{i} \circ w_{i} .

Bounding $E_{1}$ : For k-th componet of $E_{1}$ , we denote

E_{1 k} = \frac{1}{n} \sum_{i = 1}^{n} 〈 u_{i} \circ v_{i} \circ w_{i}, β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} 〉 u_{i} \circ v_{i} \circ w_{i} - β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*} .

Define

δ_{n, p, s} = {(log n)}^{3} (\sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}} + \sqrt{\frac{s log (p / s)}{n}})} .

By using Lemma 2 and s ≤ d ≤ Cs, it suffices to have for some absolute constant C₁₁,

{‖ E_{1 k} ‖}_{s + d} \leq C_{11} δ_{n, p, s},

with probability at least 1 − 10/n³, where ‖ · ‖_s+d is the sparse tensor spectral norm defined in (II.3). Equipped with the triangle inequality, the sparse tensor spectral norm for $E_{1}$ can be bounded by

{‖ E_{1} ‖}_{s + d} \leq C_{11} δ_{n, p, s} \sum_{k = 1}^{K} η_{k}^{*},

(A.36)

with probability at least 1 − 10K/n³.

Bounding $E_{2}$ : Note that the random noise ${ϵ_{i}}_{i = 1}^{n}$ is independent of sketching vector {u_i,v_i,w_i}. For fixed ${ϵ_{i}}_{i = 1}^{n}$ , applying Lemma 20, we have for some absolute constant C₁₂

{‖ \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} u_{i} \circ v_{i} \circ w_{i} ‖}_{s + d} \leq C_{12} ‖ ϵ ‖_{\infty} C_{11} δ_{n, p, s},

with probability at least 1−1/p. According to Lemma 23, we have

ℙ ({‖ E_{2} ‖}_{s + d} \geq C_{12} σ log n δ_{n, p, s}) \leq \frac{1}{p} + \frac{3}{n} \leq \frac{4}{n} .

(A.37)

Bounding $E$ : Putting (A.36) and (A.37) together, we obtain

‖ E ‖_{s + d} \leq (C_{11} \sum_{k = 1}^{K} η_{k}^{*} + C_{12} σ log n) δ_{n, p, s},

with probability at least 1 − 5/n. Under Condition 9, we have

‖ E ‖_{s + d} \leq 2 C_{1} \sum_{k = 1}^{K} η_{k}^{*} δ_{n, p, s} log n,

with probability at least 1 − 5/n.

The perturbation error analysis for the symmetric tensor estimation model and the interaction effect model is similar since the empirical first-order moment converges much faster than the empirical third-order moment. So we omit the detailed proof here. ■

J. Proof of Lemma 11

Lemma 11 quantifies one step update for thresholded gradient update. The proof consists of two parts.

First, we evaluate an oracle estimator ${{\tilde{β}}_{k}^{(t + 1)}}_{k = 1}^{K}$ with known support information, which is defined as

{\tilde{β}}_{k}^{(t + 1)} = φ_{\frac{μ}{ϕ} h (β_{k}^{(t)})} (β_{k}^{(t)} - \frac{μ}{ϕ} \nabla_{k} L {(β_{k}^{(t)})}_{F^{(t)}}) .

(A.38)

Here,

$h (β_{k}^{(t)})$ is the k-th component of h(B^(t)) defined in (III-B)
$\nabla_{B} L (B) = (\nabla_{1} L (β_{1}), \dots, \nabla_{K} L (β_{K})) .$
$F^{(t)} = \cup_{k = 1}^{K} F_{k}^{(t)}$ , where $F_{k}^{(t)} = supp (β_{k}^{*}) \cup supp (β_{k}^{(t)})$ .
For a vector $x \in ℝ^{p}$ and a subset A ⊂ {1,…,p}, we denote $x_{A} \in ℝ^{p}$ by keeping the coordinates of x with indices in A unchanged, while changing all other components to zero.

We will show that ${\tilde{β}}_{k}^{(t + 1)}$ converges as a geometric rate for optimization error and an optimal rate for statistical error. See Lemma 13 for details.

Second, we aim to prove that ${\tilde{β}}_{k}^{(t + 1)}$ and $β_{k}^{(t + 1)}$ are almost equivalent with high probability. See Lemma 14 for details. For simplicity, we drop the superscript of $β_{k}^{(t)}$ , F^(t) in the following proof, and denote ${\tilde{β}}_{k}^{(t + 1)}$ , $β_{k}^{(t + 1)}$ and F^(t+1) by ${\tilde{β}}_{k}^{+}$ , ${\tilde{β}}_{k}^{+}$ and F⁺ respectively.

Lemma 13. Suppose Conditions 1–5 hold. Assume (A.13) is satisfied and |F| ≲ Ks. As long as the step size μ ≤ 32R^−20/3/(3K[220 + 270K]²), we obtain the upper bound for ${{\tilde{β}}_{k}^{+}}$ ,

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} {\tilde{β}}_{k}^{+} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq (1 - 32 μ \frac{R^{- \frac{8}{3}}}{K^{2}}) \sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} + 2 C_{3} μ^{2} R^{- \frac{8}{3}} η_{min}^{* - \frac{4}{3}} \frac{σ^{2} K^{- 2} s log p}{n},

(A.39)

with probability at least 1 − (21K² + 11K + 4Ks)/n.

The proof of Lemma 13 is postponed to the Section L. Next lemma guarantees that with high probability, ${β_{k}^{+}}_{k = 1}^{K}$ is equivalent to the oracle update ${{\tilde{β}}_{k}^{+}}_{k = 1}^{K}$ with high probability.

Lemma 14. Recall that the truncation level h(β_k) is defined as

h (β_{k}) = \frac{\sqrt{4 log n p}}{n} * \sqrt{\sum_{i = 1}^{n} {(\sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3} - y_{i})}^{2} {(η_{k} {(x_{i}^{⊤} β_{k})}^{2})}^{2}} .

(A.40)

If |F| ≲ Ks, we have $β_{k}^{+} = {\tilde{β}}_{k}^{+}$ for any k ∈ [K] with probability at least 1 − (n²p)⁻¹ and F⁺ ⊂ F.

The proof of Lemma 14 is postponed to the Section L. By using Lemma 14 and induction, we have

F^{(t + 1)} \subset \dots F^{(1)} \subset F^{(0)} = \cup_{k = 1}^{K} supp (β_{k}^{*}) \cup supp (β_{k}^{(0)}) .

It implies for every t, we have |F^(t)| ≲ Ks. Combining with Lemmas 13 and 14 together, we obtain with probability at least 1 − (21K² + 11K + 4Ks)/n,

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k}^{+} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq (1 - 32 μ K^{- 2} R^{- \frac{8}{3}}) \sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} + 2 C_{3} μ^{2} R^{- \frac{8}{3}} η_{min}^{* - \frac{4}{3}} \frac{σ^{2} K^{- 2} s log p}{n},

(A.41)

This ends the proof. ■

K. Proof of Lemma 12

Based on the CP low-rank structure of true tensor parameter $J *$ , we can explicitly write down the distance between $J$ and $J *$ under tensor Frobenius norm as follows

\begin{array}{l} {‖ J - J * ‖}_{F}^{2} \\ = \sum_{i_{1}, i_{2}, i_{3}} {(\sum_{k = 1}^{n} η_{k} β_{k i_{1}} β_{k i_{2}} β_{k i_{3}} - \sum_{k = 1}^{n} η_{k}^{*} β_{k i_{1}}^{*} β_{k i_{2}}^{*} β_{k i_{3}}^{*})}^{2} . \end{array}

For notation simplicity, denote ${\bar{β}}_{k} = \sqrt[3]{η_{k}} β_{k}, {\bar{β}}_{k}^{*} = \sqrt[3]{η_{k}^{*}} β_{k}^{*}$ . Then

\begin{array}{l} {‖ J - J * ‖}_{F}^{2} \\ = \sum_{i_{1}, i_{2}, i_{3}} {(\sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} {\bar{β}}_{k i_{2}} {\bar{β}}_{k i_{3}} - \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}}^{*} {\bar{β}}_{k i_{2}}^{*} {\bar{β}}_{k i_{3}}^{*})}^{2} \\ = \sum_{i_{1}, i_{2}, i_{3}} {(\sum_{k = 1}^{K} ({\bar{β}}_{k i_{1}} - {\bar{β}}_{k i_{1}}^{*}) {\bar{β}}_{k i_{2}}^{*} {\bar{β}}_{k i_{3}}^{*} + \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} ({\bar{β}}_{k i_{2}} - {\bar{β}}_{k i_{2}}^{*}) {\bar{β}}_{k i_{3}}^{*} + \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} {\bar{β}}_{k i_{2}} ({\bar{β}}_{k i_{3}} - {\bar{β}}_{k i_{3}}^{*}) + \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} {\bar{β}}_{k i_{2}} ({\bar{β}}_{k i_{3}} - {\bar{β}}_{k i_{3}}^{*}))}^{2} = RHS . \end{array}

Since (a + b + c)² ≤ 3(a² + b² + c²), we have

RHS \leq 3 \sum_{i_{1}, i_{2}, i_{3}} [{(\sum_{k = 1}^{K} ({\bar{β}}_{k i_{1}} - {\bar{β}}_{k i_{1}}^{*}) {\bar{β}}_{k i_{2}}^{*} {\bar{β}}_{k i_{3}}^{*})}^{2} + {(\sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} ({\bar{β}}_{k i_{2}} - {\bar{β}}_{k i_{2}}^{*}) {\bar{β}}_{k i_{3}}^{*})}^{2} + {(\sum_{k = 1}^{K} {\bar{β}}_{k i_{1}} {\bar{β}}_{k i_{2}} ({\bar{β}}_{k i_{3}} - {\bar{β}}_{k i_{3}}^{*}))}^{2}] .

Equipped with Cauchy-Schwarz inequality, RHS can be further bounded by

RHS \leq 3 \sum_{i_{1}, i_{2}, i_{3}} [\sum_{k = 1}^{K} {({\bar{β}}_{k i_{1}} - {\bar{β}}_{k i_{1}}^{*})}^{2} \sum_{k = 1}^{K} {\bar{β}}_{k i_{2}}^{* 2} {\bar{β}}_{k i_{3}}^{* 2} + \sum_{k = 1}^{K} {({\bar{β}}_{k i_{2}} - {\bar{β}}_{k i_{2}}^{*})}^{2} \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}}^{2} {\bar{β}}_{k i_{3}}^{* 2} + \sum_{k = 1}^{K} {({\bar{β}}_{k i_{3}} - {\bar{β}}_{k i_{3}}^{*})}^{2} \sum_{k = 1}^{K} {\bar{β}}_{k i_{2}}^{2} {\bar{β}}_{k i_{1}}^{2}]

At the same time, using $η_{k} \leq (1 + c) η_{k}^{*}$ for k ∈ [K],

\begin{array}{l} {‖ J - J * ‖}_{F}^{2} \\ \leq 3 [\sum_{i_{1} = 1}^{p} \sum_{k = 1}^{K} {({\bar{β}}_{k i_{1}} - {\bar{β}}_{k i_{1}}^{*})}^{2} (\sum_{i_{2} = 1}^{p} \sum_{i_{3} = 1}^{p} \sum_{k = 1}^{K} {\bar{β}}_{k i_{2}}^{* 2} {\bar{β}}_{k i_{3}}^{* 2}) + \sum_{i_{2} = 1}^{p} \sum_{k = 1}^{K} {({\bar{β}}_{k i_{2}} - {\bar{β}}_{k i_{2}}^{*})}^{2} (\sum_{i_{1} = 1}^{p} \sum_{i_{3} = 1}^{p} \sum_{k = 1}^{K} {\bar{β}}_{k i_{1}}^{2} {\bar{β}}_{k i_{3}}^{* 2}) + \sum_{i_{3} = 1}^{p} \sum_{k = 1}^{K} {({\bar{β}}_{k i_{3}} - {\bar{β}}_{k i_{3}}^{*})}^{2} (\sum_{i_{2} = 1}^{p} \sum_{i_{1} = 1}^{p} \sum_{k = 1}^{K} {\bar{β}}_{k i_{2}}^{2} {\bar{β}}_{k i_{1}}^{2})] \\ = 3 (\sum_{k = 1}^{K} {‖ {\bar{β}}_{k} - {\bar{β}}_{k}^{*} ‖}_{2}^{2}) (\sum_{k = 1}^{K} {(\sqrt[3]{η_{k}^{*}})}^{4} + \sum_{k = 1}^{K} {(\sqrt[3]{η_{k}^{*}})}^{2} {(\sqrt[3]{η_{k}})}^{2} + \sum_{k = 1}^{K} {(\sqrt[3]{η_{k}})}^{4}) \\ \leq 9 (1 + c) (\sum_{k = 1}^{K} {‖ {\bar{β}}_{k} - {\bar{β}}_{k}^{*} ‖}_{2}^{2}) (\sum_{k = 1}^{K} {(\sqrt[3]{η_{k}^{*}})}^{4}) . \end{array}

For the non-symmetric tensor estimation model, we have

{‖ J - J * ‖}_{F}^{2} = \sum_{i_{1}, i_{2}, i_{3}} {(\sum_{k = 1}^{K} η_{k} β_{1 k i_{1}} β_{2 k i_{2}} β_{3 k i_{3}} - \sum_{k = 1}^{K} η_{k}^{*} β_{1 k i_{1}}^{*} β_{2 k i_{2}}^{*} β_{3 k i_{3}}^{*})}^{2} .

Following the same strategy above, we obtain

{‖ J - J * ‖}_{F}^{2} \leq 3 (1 + c) (\sum_{k = 1}^{K} {‖ {\bar{β}}_{1 k} - {\bar{β}}_{1 k}^{*} ‖}_{2}^{2} + \sum_{k = 1}^{K} {‖ {\bar{β}}_{2 k} - {\bar{β}}_{2 k}^{*} ‖}_{2}^{2} + \sum_{k = 1}^{K} {‖ {\bar{β}}_{3 k} - {\bar{β}}_{3 k}^{*} ‖}_{2}^{2}) (\sum_{k = 1}^{K} {(\sqrt[3]{η_{k}^{*}})}^{4}) .

This ends the proof. ■

L. Proof of Lemma 13

First of all, we state a lemma to illustrate the effect of weight ϕ. The proof of Lemma 15 is deferred to Section 15.

Lemma 15. Consider ${y_{i}}_{i = 1}^{n}$ come from either non-symmetric tensor estimation model (VI.1) or symmetric tensor estimation model (III.1). Suppose Conditions 3–5 hold. Then $ϕ = \frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}$ is upper and lower bounded by

(16 - 6 Γ^{3} - 9 Γ) {(\sum_{k = 1}^{K} η_{k}^{*})}^{2} \leq \frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2} \leq (16 + 6 Γ^{3} + 9 Γ) {(\sum_{k = 1}^{K} η_{k}^{*})}^{2},

with probability at least 1 − (K² + K + 3)/n, where Γ is the incoherence parameter defined in Definition 3.

According to Lemma 15, $\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}$ approximates ${(\sum_{k = 1}^{K} η_{k}^{*})}^{2}$ up to some constants with high probability. Moreover, we know that from (A.13), ${max}_{k} | η_{k} - η_{k}^{*} | \leq ε_{0}$ for some small ε₀. Based on those two facts described above, we replace ηk by $η_{k}^{*}$ and ϕ by ${(\sum_{k = 1}^{K} η_{k}^{*})}^{2}$ for the sake of completeness. Note that this change could only result in some constant scale changes for final results. Similar simplification was used in matrix recovery scenario [62]. Therefore, we define the weighted estimator and weighted true parameter as ${\bar{β}}_{k} = \sqrt[3]{η_{k}^{*}} β_{k}$ , ${\bar{β}}_{k}^{*} = \sqrt[3]{η_{k}^{*}} β_{k}^{*}$ . Now, $η_{k}^{*} β_{k} \circ β_{k} \circ β_{k} = {\bar{β}}_{k} \circ {\bar{β}}_{k} \circ {\bar{β}}_{k}$ . Recall · is the loss function defined in (III.4). Correspondingly with a slight abuse of notation, define the gradient function $\nabla_{k} L ({\bar{β}}_{k})$ on F as

\nabla_{k} L {({\bar{β}}_{k})}_{F} = \frac{6 \sqrt[3]{η_{k}^{*}}}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} {(x_{i_{F}}^{⊤} {\bar{β}}_{k^{'}})}^{3} - y_{i}) {(x_{i_{F}}^{⊤} {\bar{β}}_{k})}^{2} x_{i_{F}},

and its noiseless version as

\nabla_{k} L {({\bar{β}}_{k})}_{F} = \frac{6 \sqrt[3]{η_{k}^{*}}}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} {(x_{i_{F}}^{⊤} {\bar{β}}_{k^{'}})}^{3} - \sum_{k^{'} = 1}^{K} {(x_{i_{F}}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{3}) {(x_{i_{F}}^{⊤} {\bar{β}}_{k})}^{2} x_{i_{F}} .

(A.42)

According to the definition of thresholding function (III.8), ${\tilde{β}}_{k}^{+}$ can be written as

{\tilde{β}}_{k}^{+} = β_{k} - \frac{μ}{ϕ} \nabla_{k} L {({\bar{β}}_{k})}_{F} + \frac{μ}{ϕ} h ({\bar{β}}_{k}) γ_{k},

where $γ_{k} \in ℝ^{p}$ satisfies $supp (γ_{k}) \subset F, {‖ γ_{k} ‖}_{\infty} \leq 1$ and $h ({\bar{β}}_{k})$ is defined as

h ({\bar{β}}_{k}) = \frac{\sqrt{4 log (n p)}}{n} * \sqrt{\sum_{i = 1}^{n} {(\sum_{k = 1}^{K} {(x_{i_{F}}^{⊤} {\bar{β}}_{k})}^{3} - y_{i})}^{2} η_{k}^{* \frac{2}{3}} {(x_{i_{F}}^{⊤} {\bar{β}}_{k})}^{2}} .

(A.43)

Moreover, we denote $z_{k} = {\bar{β}}_{k} - {\bar{β}}_{k}^{*}$ . With a little abuse of notations, we also drop the subscript F in this section for notation simplicities.

We expand and decompose the sum of square error by three parts as follows:

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}^{*}} {\tilde{β}}_{k}^{+} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} = \sum_{k = 1}^{K} {‖ z_{k} - \frac{μ \sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} L ({\bar{β}}_{k}) + \frac{μ \sqrt[3]{η_{k}^{*}}}{ϕ} h ({\bar{β}}_{k}) γ_{k} ‖}_{2}^{2} = \underset{A: gradient update effect}{\underset{︸}{\sum_{k = 1}^{K} {‖ z_{k} - μ \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} L ({\bar{β}}_{k}) ‖}_{2}^{2}}} + \underset{B: threshoding effect}{\underset{︸}{\sum_{k = 1}^{K} {‖ \frac{μ \sqrt[3]{η_{k}^{*}}}{ϕ} h ({\bar{β}}_{k}) γ_{k} ‖}_{2}^{2}}} + \underset{C:  cross term}{\underset{︸}{\sum_{k = 1}^{K} 〈 z_{k} - μ \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} L ({\bar{β}}_{k}), \frac{μ \sqrt[3]{η_{k}}}{ϕ} h ({\bar{β}}_{k}) γ_{k} 〉 .}}

(A.44)

In the following proof, we will bound three parts sequentially.

1). Bounding gradient update effect:

In order to separate the optimization error and statistical error, we use the noiseless gradient $\nabla_{k} \tilde{L} ({\bar{β}}_{k})$ as a bridge such that A can be decomposed as

A = \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} - 2 μ \sum_{k = 1}^{K} 〈 \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} L ({\bar{β}}_{k}), z_{k} 〉 + μ^{2} \sum_{k = 1}^{K} {‖ \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} L ({\bar{β}}_{k}) ‖}_{2}^{2} \leq \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} - 2 μ \underset{A_{1}}{\underset{︸}{\sum_{k = 1}^{K} 〈 \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} \tilde{L} ({\bar{β}}_{k}), z_{k} 〉}} + 2 μ^{2} \underset{A_{2}}{\underset{︸}{\sum_{k = 1}^{K} {‖ \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} \tilde{L} ({\bar{β}}_{k}) ‖}_{2}^{2}}} + 2 μ^{2} \underset{A_{3}}{\underset{︸}{\sum_{k = 1}^{K} {‖ \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} (\nabla_{k} \tilde{L} ({\bar{β}}_{k}) - \nabla_{k} L ({\bar{β}}_{k})) ‖}_{2}^{2}}} + 2 μ \underset{A_{4}}{\underset{︸}{\sum_{k = 1}^{K} 〈 z_{k}, \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} (\nabla_{k} \tilde{L} ({\bar{β}}_{k}) - \nabla_{k} L ({\bar{β}}_{k})) 〉,}}

(A.45)

where A₁ and A₂ quantify the optimization error, A₃ quantifies the statistical error, and A₄ is a cross term which can be negligible comparing with the rate of the statistical error. The lower bound for A₁ and upper bound for A₂ together coincide with the verification of regularity conditions in the matrix recovery case [52].

Step One: Lower bound for A₁.

Plugging in $ϕ = {(\sum_{k = 1}^{K} η_{k}^{*})}^{2}$ , we have

K^{- 2} R^{- \frac{2}{3}} η_{max}^{* - \frac{4}{3}} \leq \frac{{(\sqrt[3]{η_{k}^{*}})}^{2}}{ϕ} = \frac{{(\sqrt[3]{η_{k}^{*}})}^{2}}{{(\sum_{k = 1}^{K} η_{k}^{*})}^{2}} \leq K^{- 2} R^{\frac{2}{3}} η_{max}^{* - \frac{4}{3}} .

(A.46)

According to the definition of noiseless gradient $\nabla_{k} \tilde{L} ({\tilde{β}}_{k})$ and z_k, A₁ can be expanded and decomposed sequentially by nine terms,

A_{1} \geq K^{- 2} R^{- \frac{2}{3}} η_{max}^{* - \frac{4}{3}} [\frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} (x_{i}^{⊤} z_{k}) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{11} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} 2 {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{12} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} {(x_{i}^{⊤} z_{k})}^{3}) \Leftarrow A_{13} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} (x_{i}^{⊤} z_{k}) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{14} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} 2 {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{15}

(A.47)

+ \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{16} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} (x_{i}^{⊤} z_{k}) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{17} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} 2 {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{18} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} {(x_{i}^{⊤} z_{k})}^{3})] \Leftarrow A_{19},

(A.48)

where A₁₁ is the main term according to the order of ${\bar{β}}_{k}^{*}$ , while A₁₂ to A₁₉ are remainder terms. The proof of lower bound for A₁₁ to A₁₉ follows two steps:

Calculate and lower bound the expectation of each term through Lemma A.2: high-order Gaussian moment;
Argue that the empirical version is concentrated around their expectation with high probability through Lemma 1: high-order concentration inequality.

Bounding A₁₁. Note that A₁₁ involves the product of dependent Gaussian vectors. This brings difficulties on both the calculation of expectations and the use of concentration inequality. According to the high-order Gaussian moment results in Lemma A.2, the expectation of A₁₁ can be calculated explicitly as

E (A_{11}) = 36 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} {({\bar{β}}_{k^{'}}^{* ⊤} {\bar{β}}_{k}^{*})}^{2} (z_{k^{'}}^{⊤} z_{k}) \Leftarrow I_{1} + 72 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} ({\bar{β}}_{k^{'}}^{* ⊤} {\bar{β}}_{k}^{*}) (z_{k^{'}}^{⊤} {\bar{β}}_{k}^{*}) (z_{k}^{⊤} {\bar{β}}_{k^{'}}^{*}) \Leftarrow I_{2} + 108 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} (z_{k}^{⊤} {\bar{β}}_{k^{'}}) (z_{k^{'}}^{⊤} {\bar{β}}_{k^{'}}^{*}) (z_{k}^{⊤} {\bar{β}}_{k}^{*}) \Leftarrow I_{3} + 54 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} ({\bar{β}}_{k^{'}}^{* ⊤} {\bar{β}}_{k^{'}}^{*}) (z_{k^{'}}^{⊤} {\bar{β}}_{k}^{*}) (z_{k}^{⊤} z_{k}) \Leftarrow I_{4} .

(A.49)

Note that I₁ to I₄ involve the summation of K² term. To use incoherence Condition 3, we isolate K terms with k = k′. Then, I₁ to I₄ could be lower bounded as

\begin{array}{l} I_{1} \geq 36 η_{min}^{* 4 / 3} [\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} - Γ^{2} {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2}] \\ I_{2} \geq 72 η_{min}^{* 4 / 3} [\sum_{k = 1}^{K} {(z_{k}^{⊤} {\bar{β}}_{k}^{*})}^{2} - Γ {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2}] \\ I_{3} \geq 108 η_{min}^{* 4 / 3} [\sum_{k = 1}^{K} {(z_{k}^{⊤} {\bar{β}}_{k}^{*})}^{2} - Γ {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2}] \\ I_{4} \geq 54 η_{min}^{* 4 / 3} {‖ \sum_{k = 1}^{K} z_{k} ‖}_{2}^{2} \geq 0, \end{array}

where Γ is the incoherence parameter. Putting the above four bounds together, they jointly provide

E (A_{11}) \geq 36 η_{min}^{* 4 / 3} \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} - (36 η_{min}^{* 4 / 3} Γ^{2} + 180 η_{min}^{* 4 / 3} Γ) {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2} .

(A.50)

On the other hand, repeatedly using Lemma 1, we obtain that with probability at least 1 − 1/n,

| \frac{1}{n} \sum_{i = 1}^{n} ((x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{2} (x_{i}^{⊤} z_{k}) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2} - E (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{2} (x_{i}^{⊤} z_{k}) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) | \leq C \frac{{(log n)}^{3}}{\sqrt{n}} {(\sqrt[3]{η_{max}^{*}})}^{4} {‖ z_{k^{'}} ‖}_{2} {‖ z_{k} ‖}_{2} .

Taking the summation over k,k′ ∈ [K], it could further imply that for some absolute constant C,

| A_{11} - E (A_{11}) | \leq 18 C \frac{{(log n)}^{3}}{\sqrt{n}} {(\sqrt[3]{η_{max}^{*}})}^{4} {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2},

(A.51)

with probability at least 1 − K²/n.. Combining (A.50) and (A.51), we obtain with probability at least 1 − K²/n,

K^{- 2} R^{- \frac{2}{3}} η_{max}^{* - \frac{4}{3}} A_{11} \geq [36 K^{- 2} R^{- \frac{8}{3}} - K^{- \frac{3}{2}} (216 R^{- \frac{8}{3}} Γ + 18 C \frac{{(log n)}^{3}}{\sqrt{n}})] \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2},

(A.52)

Where $R = η_{max}^{*} / η_{max}^{*}$ . Here, we use the fact Γ ≤ 1 and ${(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2})}^{2} \leq K (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2})$ .

Bounding A₁₂ to A₁₉: For remainder terms, we follow the same proof strategy. According to Lemma A.2, the expectation of A₁₂ can be calculated as

\begin{array}{l} E (A_{12}) = 36 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} {(z_{k}^{⊤} {\bar{β}}_{k^{'}}^{* ⊤})}^{2} (z_{k^{'}}^{⊤} {\bar{β}}_{k}^{*}) \Leftarrow I_{1} \\ + 72 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} (z_{k}^{⊤} {\bar{β}}_{k^{'}}^{*}) ({\bar{β}}_{k^{'}}^{* ⊤} {\bar{β}}_{k}^{*}) (z_{k^{'}}^{⊤} z_{k}) \Leftarrow I_{2} \\ + 108 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} (z_{k}^{⊤} {\bar{β}}_{k^{'}}) (z_{k^{'}}^{⊤} {\bar{β}}_{k^{'}}^{*}) (z_{k}^{⊤} {\bar{β}}_{k}^{*}) \Leftarrow I_{3} \\ + 54 \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} ({\bar{β}}_{k^{'}}^{* ⊤} {\bar{β}}_{k^{'}}^{*}) (z_{k^{'}}^{⊤} {\bar{β}}_{k}^{*}) (z_{k}^{⊤} z_{k}) \Leftarrow I_{4} . \end{array}

Let us analyze I₁ first. Under (A.13), ${‖ z_{k} ‖}_{2} \leq ε_{0} \sqrt[3]{η_{k}^{*}}$ , it suffices to show that

\sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} {(z_{k}^{⊤} {\bar{β}}_{k^{'}})}^{2} (z_{k^{'}}^{⊤} {\bar{β}}_{k}^{*}) \geq - \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} {‖ z_{k} ‖}_{2}^{2} {‖ {\bar{β}}_{k^{'}}^{*} ‖}_{2}^{2} {‖ z_{k}^{'} ‖}_{2} {‖ {\bar{β}}_{k}^{*} ‖}_{2} \geq - η_{max}^{* \frac{4}{3}} ε_{0} {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2} .

This immediately implies a lower bound for $E (A_{12})$ after we bound similarly for I₂,I₃ and I₄,

E (A_{12}) \geq - 270 η_{max}^{* \frac{4}{3}} ε_{0} {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2} .

(A.53)

By Lemma 1, we obtain for some absolute constant C,

K^{- 2} R^{- \frac{2}{3}} η_{max}^{* -_{3}^{4}} A_{12} \geq K^{- 2} R^{- \frac{2}{3}} η_{max}^{* -_{3}^{4}} [E (A_{12}) - 18 C η_{max}^{* -_{3}^{4}} ε_{0} {(\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})}^{2} \frac{{(log n)}^{3}}{\sqrt{n}}] \geq - K^{- 1} R^{- \frac{2}{3}} ε_{0} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}),

(A.54)

with probability at least 1−K²/n. The detail derivation is the same as in (A.52), so we omit here.

Similarly, the lower bounds of A₁₃ to A₁₉ can be derived as follows

K^{- \frac{1}{2}} η_{max}^{* - \frac{3}{4}} A_{14} \geq - K^{\frac{1}{2}} ε_{0} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) K^{- \frac{1}{2}} η_{max}^{* - \frac{3}{4}} A_{13}, A_{15}, A_{17} \geq - K^{\frac{1}{2}} ε_{0}^{2} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) K^{- \frac{1}{2}} η_{max}^{* - \frac{3}{4}} A_{16}, A_{18} \geq - K^{\frac{1}{2}} ε_{0}^{3} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) K^{- \frac{1}{2}} η_{max}^{* - \frac{3}{4}} A_{19} \geq - K^{\frac{1}{2}} ε_{0}^{4} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) .

(A.55)

Putting (A.52), (A.54) and (A.55) together, we have with probability at least 1 − 9K²/n,

A_{1} \geq [36 K^{- 2} R^{- \frac{8}{3}} - K^{- \frac{3}{2}} (2160 R^{- \frac{3}{3}} Γ + 18 C \frac{{(log n)}^{3}}{\sqrt{n}}) - 8 ε_{0} K^{- 1} R^{- \frac{2}{3}} (270 + 18 C \frac{{(log n)}^{3}}{\sqrt{n}})] (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) .

For the above bound,

When the sample size satisfies
$n \geq {(18 C K^{1 / 2} R^{8 / 3} {(log n)}^{3})}^{2},$
we have
$max {18 K^{- \frac{3}{2}} C \frac{{(log n)}^{3}}{\sqrt{n}}, 8 ε_{0} K^{- 1} R^{- \frac{2}{3}} 18 C \frac{{(log n)}^{3}}{\sqrt{n}}} \leq K^{- 2} R^{- \frac{8}{3}} .$
When ε₀ ≤ K⁻¹R⁻²/2160, we have
$8 ε_{0} K^{- 1} R^{- \frac{2}{3}} 270 \leq K^{- 2} R^{- \frac{8}{3}} .$
When the incoherence parameter satisfies Γ ≤ K^−1/2/216, we have
$K^{- \frac{3}{2}} 2160 R^{- \frac{8}{3}} Γ \leq K^{- 2} R^{- \frac{8}{3}} .$

Note that those above conditions can be fulfilled by Conditions 3, 5 and (A.13). Thus, we are able to simplify A₁ by

A_{1} \geq 32 K^{- 2} R^{- \frac{8}{3}} (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}),

(A.56)

with probability at least 1 − 9K²/n.

Step Two: Upper bound for A₂.

We observe the fact that

A_{2} = \sum_{k = 1}^{K} {‖ \frac{1}{ϕ} \sqrt[3]{η_{k}^{*}} \nabla_{k} \tilde{L} ({\bar{β}}_{k}) ‖}_{2}^{2} = sup_{w \in S^{K_{s - 1}}} {| 〈 \sum_{k = 1}^{R} \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} \tilde{L} ({\bar{β}}_{k}), w 〉 |}^{2},

(A.57)

where $S$ is a unit sphere. It is equivalent to show for any $w \in S^{K s - 1}, A_{2}^{'} = | 〈 \sum_{k = 1}^{K} \frac{\sqrt[3]{η_{k}^{*}}}{ϕ} \nabla_{k} \tilde{L} ({\bar{β}}_{k}), w 〉 |$ is upper bounded. According to the definition of noiseless gradient (A.42), $A_{2}^{'}$ is explicitly written as

A_{2}^{'} = \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{3} - \sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{3}) * (\sum_{k = 1}^{K} \frac{{(\sqrt[3]{η_{k}^{*}})}^{2}}{ϕ} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} (x_{i}^{⊤} w)) .

Following by (A.46) and (A.48), similar decomposition can be made for $A_{2}^{'}$ as follows, where the only difference is that we replace one $x_{i}^{⊤} z_{k}$ by $x_{i}^{⊤} w$ .

A_{2}^{'} \leq K^{- 2} R^{\frac{2}{3}} η_{min}^{* - \frac{4}{3}} [\frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} (x_{i}^{⊤} w) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{21}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} 2 (x_{i}^{⊤} z_{k}) (x_{i}^{⊤} w) (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{22}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 (x_{i}^{⊤} z_{k^{'}}) {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{2} \sum_{k = 1}^{K} {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} w)) \Leftarrow A_{23}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} (x_{i}^{⊤} w) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{24}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} 2 (x_{i}^{⊤} z_{k}) (x_{i}^{⊤} w) (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{25}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} - 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{2} (x_{i}^{⊤} {\bar{β}}_{k^{'}}) \sum_{k = 1}^{K} (x_{i}^{⊤} z_{k}) (x_{i}^{⊤} w) (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{26}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} (x_{i}^{⊤} w) {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2}) \Leftarrow A_{27}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} 2 (x_{i}^{⊤} z_{k}) (x_{i}^{⊤} w) (x_{i}^{⊤} {\bar{β}}_{k}^{*})) \Leftarrow A_{28}^{'} + \frac{6}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} 3 {(x_{i}^{⊤} z_{k^{'}})}^{3} \sum_{k = 1}^{K} {(x_{i}^{⊤} z_{k})}^{2} (x_{i}^{⊤} w))] . \Leftarrow A_{29}^{'}

Let’s bound $A_{21}^{'}$ first. By using the same technique when calculating $E (A_{11})$ in (A.49), we derive an upper bound for $E (A_{21}^{'})$ .

E (A_{21}^{'}) \leq 36 η_{max}^{* \frac{4}{3}} (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2} + (K - 1) \sum_{k = 1}^{K} Γ {‖ z_{k} ‖}_{2}) + 180 η_{max}^{* \frac{4}{3}} (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2} + (K - 1) \sum_{k = 1}^{K} Γ {‖ z_{k} ‖}_{2}) + 54 η_{max}^{* \frac{4}{3}} (K \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}) .

Equipped with Lemma 2 and the definition of tensor spectral norm (II.3), it suffices to bound $A_{21}^{'}$ by

R^{\frac{2}{3}} η_{min}^{* - \frac{3}{4}} K^{- \frac{1}{2}} A_{21}^{'} \leq K^{- 2} R^{2} [216 + 54 K + 216 K Γ + 18 C K δ_{n, p, s}] (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2})

with probability at least 1−10K²/n³, where δ_n,p,s is defined in (IV.7).

The upper bounds for $A_{22}^{'}$ to $A_{29}^{'}$ follow similar forms. Combining them together, we can derive an upper bound for $A_{2}^{'}$ as follows

A_{2}^{'} \leq K^{- 2} R^{2} [216 + 270 K + 18 C K δ_{n, p, s}] (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}) \leq K^{- 2} R^{2} [220 + 270 K] (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}),

with probability at least 1 − 90K²/n³, where the second inequality utilizes Condition 5. Therefore, the upper bound of A₂ is given as follows

A_{2} \leq K^{- 1} R^{4} {[220 + 270 K]}^{2} (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}),

(A.58)

with probability at least 1 − 90K²/n³.

Step Three: Upper bound for A₃.

By the definition of noisy gradient and noiseless gradient, A₃ is explicitly written as

A_{3} = \sum_{k = 1}^{K} {‖ \frac{{(\sqrt[3]{η_{k}^{*}})}^{2}}{ϕ} \frac{6}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i} ‖}_{2}^{2} \leq K^{- 4} R^{\frac{4}{3}} η_{min}^{* - \frac{8}{3}} \sum_{k = 1}^{K} {(\sqrt{K s} max_{j} \frac{6}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i j})}^{2},

where the second inequality comes from (A.46). For fixed ${ϵ_{i}}_{i = 1}^{n}$ , applying Lemma 1, we have

| \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i j} - E (\sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i j}) | \leq C {(log n)}^{\frac{3}{2}} ‖ ϵ ‖_{2} {‖ {\bar{β}}_{k} ‖}_{2}^{2},

with probability at least 1 − 1/n. Together with Lemma 23, we obtain for any j ∈ [Ks],

| \frac{6}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i j} | \leq 6 C C_{0} σ {‖ {\bar{β}}_{k} ‖}_{2}^{2} \frac{{(log n)}^{3 / 2}}{\sqrt{n}},

with probability at least 1 − 4/n, where σ is the noise level. According to (A.13),

{‖ {\bar{β}}_{k} - {\bar{β}}_{k}^{*} ‖}_{2}^{2} \leq \sum_{k = 1}^{K} {‖ {\bar{β}}_{k} - {\bar{β}}_{k}^{*} ‖}_{2}^{2} \leq K η_{max}^{* \frac{2}{3}} ε_{0}^{2},

which further implies ${‖ {\bar{β}}_{k} ‖}_{2}^{2} \leq {(1 + K^{\frac{1}{2}} ε_{0})}^{2} η_{max}^{* \frac{2}{3}}$ . Equipped with union bound over j ∈ [Ks],

max_{j \in [K s]} | \frac{6}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} x_{i j} | \leq 6 C C_{0} σ {(1 + K^{\frac{1}{2}} ε_{0})}^{2} {(\sqrt[3]{η_{max}^{*}})}^{2} \frac{{(log n)}^{3 / 2}}{\sqrt{n}}, \leq 6 C C_{0} σ {(1 + K^{\frac{1}{2}} ε_{0})}^{2} {(\sqrt[3]{η_{max}^{*}})}^{2} \frac{{(log n)}^{3 / 2}}{\sqrt{n}},

with probability at least 1 − 4Ks/n. Letting $C = 6 C_{0} {(C e)}^{- 2 / 3} {(1 + K^{\frac{1}{2}} ε_{0})}^{2}$ ,

A_{3} \leq C η_{min}^{* - \frac{4}{3}} R^{\frac{8}{3}} σ^{2} K^{- 2} \frac{s {(log n)}^{3}}{n},

(A.59)

with probability at least 1 − 4Ks/n.

Step Four: Upper bound for A₄.

This cross term can be written as

A_{4} = 2 \sum_{k = 1}^{K} \frac{μ}{ϕ} {(\sqrt[3]{η_{k}^{*}})}^{2} (\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} {\bar{β}}_{k})}^{2} (x_{i}^{⊤} z_{k})) .

To bound this term, we take the same step in Step Three which fixes the noise term ${ϵ_{i}}_{i = 1}^{n}$ first. Similarly, we obtain with probability at least 1 − 4K/n,

A_{4} \leq 2 C σ \frac{{(log n)}^{\frac{3}{2}}}{\sqrt{n}} K^{- 1} R^{\frac{4}{3}} η_{min}^{* - \frac{2}{3}} .

(A.60)

This term is negligible in terms of the order when comparing with (A.59).

Summary. Putting the bounds (A.56), (A.58), (A.59) and (A.60) together, we achieve an upper bound for gradient update effect as follows,

A \leq (1 - 64 μ K^{- 2} R^{- \frac{8}{3}} + 2 μ^{2} K^{- 1} R^{4} {[220 + 270 K]}^{2}) \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} + 4 μ C K^{- 2} η_{min}^{* - \frac{4}{3}} R^{\frac{8}{3}} \frac{σ^{2} s {(log n)}^{3}}{n},

(A.61)

with probability at least 1 − (18K² + 4K + 4Ks)/n. ■

2). Bounding thresholding effect:

The thresholding effect term in (A.44) can also be decomposed into optimization error and statistical error. Recall that B can be explicitly written as

B = \sum_{k = 1}^{K} {‖ μ \frac{η_{k}^{* \frac{2}{3}}}{ϕ} \frac{4 \sqrt{log (n p)}}{n} * \sqrt{\sum_{i = 1}^{n} {(\sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{3} - y_{i})}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4} γ_{k}} ‖}_{2}^{2},

where supp(γ_k) ⊂ F_k and ‖γ_k‖_∞ ≤ 1. By using (a + b)² ≤ 2(a² + b²), we have

B \leq μ^{2} \frac{64 K s log p}{n} (B_{1} + B_{2}),

where

\begin{array}{l} B_{1} = \frac{1}{n} \sum_{i = 1}^{n} (\sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{3} - \sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{3}) (\sum_{k = 1}^{K} \frac{η_{k}^{* \frac{4}{3}}}{ϕ^{2}} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4}) \\ B_{2} = \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} \sum_{k = 1}^{K} \frac{η_{k}^{* \frac{4}{3}}}{ϕ^{2}} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4} . \end{array}

Bounding B₁. This optimization error term shares similar structure with (A.57) but with higher order. Therefore, we follow the same idea as we did in bounding (A.57). Following by (A.46) and some basic expansions and inequalities,

B_{1} \leq K^{- 2} R^{\frac{4}{3}} η_{min}^{* - \frac{8}{3}} \frac{1}{n} (\sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{3} - \sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{3}) (\sum_{k = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4}) \leq K^{- 2} R^{\frac{4}{3}} η_{min}^{* - \frac{8}{3}} [\frac{1}{n} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} 3 K {(x_{i}^{⊤} z_{k})}^{6} + 9 K {(x_{i}^{⊤} z_{k})}^{4} {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{2} + 9 K {(x_{i}^{⊤} z_{k})}^{2} {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{4}) \sum_{k^{'} = 1}^{K} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{4}] .

The main term is ${(x_{i}^{⊤} z_{k})}^{2} {(x_{i}^{⊤} {\bar{β}}_{k}^{*})}^{4}$ according to the order of ${\bar{β}}_{k}^{*}$ . We bound the main term first. Note that there exists some positive large constant C such that

E (\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} z_{k})}^{2} {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{4} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{4}) \leq C {‖ z_{k} ‖}_{2}^{2} {‖ {\bar{β}}_{k}^{*} ‖}_{2}^{4} {‖ {\bar{β}}_{k^{'}} ‖}_{2}^{4} .

Together with Lemma 1 and (A.13), we have

\sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} (\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} z_{k})}^{2} {(x_{i}^{⊤} {\bar{β}}_{k^{'}}^{*})}^{4} {(x_{i}^{⊤} {\bar{β}}_{k^{'}})}^{4}) \leq C (1 + \frac{{(log n)}^{5}}{\sqrt{n}}) K^{2} η_{max}^{* \frac{8}{3}} {(1 + K^{\frac{1}{2}} ε_{0})}^{4} \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} .

with probability at least 1 − 3K²/n. Overall, the upper bound of B₁ takes the form

B_{1} \leq K^{- 2} R^{\frac{4}{3}} η_{min}^{* - \frac{8}{3}} [18 C (1 + \frac{{(log n)}^{5}}{\sqrt{n}}) * K^{2} η_{min}^{* - \frac{8}{3}} {(1 + K^{\frac{1}{2}} ε_{0})}^{4} \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}] \leq R^{4} 18 C (1 + \frac{{(log n)}^{5}}{\sqrt{n}}) {(1 + K^{\frac{1}{2}} ε_{0})}^{4} \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2},

(A.62)

with probability at least 1 − 3K²/n.

Bounding B₂. We rewrite B₂ by

B_{2} = \sum_{k = 1}^{K} \frac{η_{k}^{* \frac{4}{3}}}{ϕ^{2}} (\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4}) .

For fixed ${ϵ_{i}}_{i = 1}^{n}$ , accordingly to Lemma 1, we have

| \sum_{i = 1}^{n} ϵ_{i}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4} - E (\sum_{i = 1}^{n} ϵ_{i}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4}) | \leq C {(log n)}^{2} {‖ ϵ^{2} ‖}_{2} {‖ {\bar{β}}_{k} ‖}_{2}^{4} .

Note that $E ({(x_{i}^{⊤} {\bar{β}}_{k})}^{4}) = 3 {‖ {\bar{β}}_{k} ‖}_{2}^{4}$ . It will reduce to

\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4} \leq (\frac{3}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} + C \frac{{(log n)}^{2}}{n} {‖ ϵ^{2} ‖}_{2}) {‖ {\bar{β}}_{k} ‖}_{2}^{4} .

From Lemma 23, with probability at least 1 − 3/n,

| \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} | \leq C_{0} σ^{2}, \frac{1}{n} {‖ ϵ^{2} ‖}_{2} \leq C_{0} \frac{σ^{2}}{\sqrt{n}} .

Combining the above two inequalities, we obtain

| \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} {(x_{i}^{⊤} {\bar{β}}_{k})}^{4} | \leq 6 C_{0} σ^{2} {‖ {\bar{β}}_{k} ‖}_{2}^{4},

(A.63)

with probability at least. Plugging in the definition of ϕ and (A.13), B₂ is upper bounded by

B_{2} \leq 6 C_{0} σ^{2} {(1 + K^{\frac{1}{2}} ε_{0})}^{4} η_{min}^{* - \frac{4}{3}} R^{\frac{8}{3}} K^{- 3},

(A.64)

with probability at least 1 − 7K/n.

Summary. Putting the bounds (A.62) and (A.64) together, we have similar upper bound for thresholded effect,

B \leq C_{2} μ^{2} R^{4} \sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2} + C_{3} μ^{2} η_{min}^{* - \frac{4}{3}} R^{\frac{8}{3}} K^{- 2} \frac{σ^{2} s log p}{n},

(A.65)

with probability at least 1 − (3K² + 7K)/n. ■

3). Ensemble:

From the definition of γ_k, it’s not hard to see actually the cross term C is equal to zero. Combining the upper bound of gradient update effect (A.61) and thresholding effect (A.65) together, we obtain

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} {\tilde{β}}_{k}^{+} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} \leq (1 - 64 μ K^{- 2} R^{- \frac{8}{3}} + 3 μ^{2} K^{- 1} R^{4} {[220 + 270 K]}^{2}) (\sum_{k = 1}^{K} {‖ z_{k} ‖}_{2}^{2}) + + 2 C_{3} μ^{2} R^{\frac{8}{3}} η_{min}^{* - \frac{4}{3}} \frac{σ^{2} K^{- 2} s log p}{n} .

As long as the step size μ satisfies

0 < μ \leq \frac{32 R^{- 20 / 3}}{3 K {[220 + 270 K]}^{2}},

we reach the conclusion

\sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} {\tilde{β}}_{k}^{+} - \sqrt[3]{η_{k}^{*}} {\tilde{β}}_{k}^{*} ‖}_{2}^{2} \leq (1 - 32 μ K^{- 2} R^{- \frac{8}{3}}) \sum_{k = 1}^{K} {‖ \sqrt[3]{η_{k}} β_{k} - \sqrt[3]{η_{k}^{*}} β_{k}^{*} ‖}_{2}^{2} + 2 C_{3} μ^{2} R^{- \frac{8}{3}} η_{min}^{* - \frac{4}{3}} \frac{σ^{2} K^{- 2} s log p}{n},

(A.66)

with probability at least 1 − 4Ks/n.

M. Proof of Lemma 14

Let us consider k-th component first. Without loss of generality, suppose F ⊂ {1,2,…,Ks}. For j = Ks + 1,…,p,

\frac{\partial}{\partial β_{k j}} L (β_{k}) = \frac{2}{n} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3} - y_{i}) η_{k} {(x_{i}^{⊤} β_{k})}^{2} x_{i j},

(A.67)

and it’s not hard to see the independence between ${x_{i}^{⊤} β_{k}, y_{i}}$ and x_ij. Applying standard Hoeffding’s inequality, we have with probability at least $1 - \frac{1}{n^{2} p^{2}}$ ,

| \frac{\partial}{\partial β_{k j}} L (β_{k}) | \leq \frac{\sqrt{4 log (n p)}}{n} * \sqrt{\sum_{i = 1}^{n} {(\sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3} - y_{i})}^{2} {(η_{k} (x_{i}^{⊤} β_{k}))}^{2}} = h (β_{k}) .

Equipped with union bound, with probability at least $1 - \frac{1}{n^{2} p}$ ,

max_{K s + 1 \leq j \leq p} | \frac{\partial}{\partial β_{k j}} L (β_{k}) | \leq h (β_{k}) .

Therefore, according to the definition of thresholding function φ(x), we obtain the following equivalence,

φ_{\frac{μ}{ϕ} h (β_{k})} (β_{k} - \frac{μ}{ϕ} \nabla_{β_{k}} L (β_{k})) = φ_{\frac{μ}{ϕ} h (β_{k})} (β_{k} - \frac{μ}{ϕ} \nabla_{β_{k}} L {(β_{k})}_{F}),

(A.68)

holds for k ∈ [K], with probability at least $1 - \frac{1}{n^{2} p}$ . (A.68) also provides that $supp (β_{k}^{+}) \subset F$ for every k ∈ [K], which further implies F⁺ ⊂ F. Now we end the proof. ■

N. Proof of Lemma 15

First, we consider symmetric case. According to the definition of ${y_{i}}_{i = 1}^{n}$ from symmetric tensor estimation model (III.1), we separate the random noise ϵ_i by the following expansion,

\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {[\sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3} + ϵ_{i}]}^{2} = \underset{I_{1}}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} {(\sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3})}^{2}}} + \underset{I_{2}}{\underset{︸}{\frac{2}{n} \sum_{i = 1}^{n} ϵ_{i} \sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3}}} + \underset{I_{3}}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2}}} .

(A.69)

Bounding I₁. We expand i-th component of I₁ as follows.

{(\sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3})}^{2} = \sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{6} + 2 \sum_{k_{i} < k_{j}} η_{k_{i}}^{*} η_{k_{j}}^{*} {(x_{i}^{⊤} β_{k_{i}}^{*})}^{3} {(x_{i}^{⊤} β_{k_{j}}^{*})}^{3} .

(A.70)

As shown in Corollary A.2, the expectations of above two parts takes forms of

E (x_{i}^{⊤} β_{k}^{*} = 6 {(β_{k_{i}}^{* ⊤} β_{k_{j}}^{*})}^{3} + 9 (β_{k_{i}}^{* ⊤} β_{k_{j}}^{*}) {‖ β_{k_{i}}^{*} ‖}_{2}^{2} {‖ β_{k_{j}}^{*} ‖}_{2}^{2} E {(x_{i}^{⊤} β_{k}^{*})}^{6} = 15 {‖ β_{k}^{*} ‖}_{2}^{2} .

Recall that ${‖ β_{k}^{*} ‖}_{2} = 1$ for any k ∈ [K] and Condition 3 implies for any k_i ≠ k_j, $| β_{k_{i}}^{* ⊤} β_{k_{j}}^{*} | \leq Γ$ , where Γ is the incoherence parameter. Thus, $E {(x_{i}^{⊤} β_{k_{i}}^{*})}^{3} {(x_{i}^{⊤} β_{k_{j}}^{*})}^{3}$ is upper bounded by

| E {(x_{i}^{⊤} β_{k_{i}}^{*})}^{3} {(x_{i}^{⊤} β_{k_{j}}^{*})}^{3} | \leq 6 Γ^{3} + 9 Γ, for any k_{i} \neq k_{j} .

(A.71)

By using the concentration result in Lemma 1, we have with probability at least 1 − 1/n

| \frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} β_{k}^{*})}^{6} - E (\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} β_{k}^{*})}^{6}) | \leq C_{1} \frac{{(log n)}^{3}}{\sqrt{n}}, | \frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} β_{k_{i}}^{*})}^{3} {(x_{i}^{⊤} β_{k_{j}}^{*})}^{3} - E (\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{⊤} β_{k_{i}}^{*})}^{3} {(x_{i}^{⊤} β_{k_{j}}^{*})}^{3}) | \leq C_{1} \frac{{(log n)}^{3}}{\sqrt{n}} .

(A.72)

Putting (A.70),(A.71) and (A.72) together, this essentially provides an upper bound for I₁, namely

\frac{1}{n} \sum_{i = 1}^{n} {(\sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3})}^{2} \leq (15 + 6 Γ^{3} + 9 Γ + 2 C_{1} \frac{{(log n)}^{3}}{\sqrt{n}}) {(\sum_{k = 1}^{K} η_{k}^{*})}^{2},

(A.73)

with probability at least 1 − K²/n.

Bounding I₂. Since the random noise ${ϵ_{i}}_{i = 1}^{n}$ is of mean zero and independent of {x_i}, we have

E (ϵ_{i} \sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3}) = 0.

By using the independence and Corollary 1, we have

ℙ (\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} β_{k}^{*})}^{3} \geq C_{2} \frac{{(log n)}^{\frac{3}{2}}}{n} \sqrt{n} σ) \leq ℙ (\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} {(x_{i}^{⊤} β_{k}^{*})}^{3} \geq C_{2} \frac{σ {(log n)}^{\frac{3}{2}}}{\sqrt{n}} | ‖ ϵ ‖_{2} \leq C_{0} σ \sqrt{n}) + ℙ (‖ ϵ ‖_{2} \geq C_{0} \sqrt{n} σ) \leq \frac{1}{n} + \frac{3}{n} = \frac{4}{n} .

This further implies that

\frac{1}{n} \sum_{i = 1}^{n} \sum_{k = 1}^{K} η_{k}^{*} {(x_{i}^{⊤} β_{k}^{*})}^{3} ϵ_{i} \leq (\sum_{k = 1}^{K} η_{k}^{*}) C_{2} \frac{{(log n)}^{\frac{3}{2}}}{\sqrt{n}} σ,

(A.74)

with probability at least 1 − 4K/n.

Bounding I₃. As shown in Lemma 23, the random noise ϵ_i with sub-exponential tail satisfies

\frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} \leq C_{3} σ^{2} .

(A.75)

with probability at least 1 − 3/n.

Overall, putting (A.73), (A.74) and (A.75) together, we have with probability at least 1 − (K² + 4K + 3)/n,

\frac{\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}}{{(\sum_{k = 1}^{K} η_{k}^{*})}^{2}} \leq 15 + 6 Γ^{3} + 9 Γ + 2 C_{1} \frac{{(log n)}^{3}}{\sqrt{n}} + \frac{2 C_{2} σ}{(\sum_{k = 1}^{K} η_{k}^{*})} \frac{{(log n)}^{\frac{3}{2}}}{\sqrt{n}} + \frac{C_{3} σ^{2}}{{(\sum_{k = 1}^{K} η_{k}^{*})}^{2}} .

Under Conditions 4 & 5, the above bound reduces to

\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2} \leq (16 + 6 Γ^{3} + 9 Γ) {(\sum_{k = 1}^{K} η_{k}^{*})}^{2}

with probability at least 1 − (K² + 4K + 3)/n. The proof of lower bound is similar, and hence is omitted here.

Similar results will also hold for non-symmetric tensor estimation model. Throughout the proof, the only difference is that

E {(u_{i}^{⊤} β_{1 k}^{*})}^{2} {(v_{i}^{⊤} β_{2 k}^{*})}^{2} {(w_{i}^{⊤} β_{3 k}^{*})}^{2} = 1.

O. Non-symmetric Tensor Estimation

1). Conditions and Algorithm:

In this subsection, we provide several essential conditions for Theorem 5 and the detail algorithm for non-symmetric tensor estimation.

Condition 6 (Uniqueness of CP-decomposition). The CP-decomposition form (VI.2) is unique in the sense that if there exists another CP-decomposition $J * = \sum_{k = 1}^{K^{'}} η_{k}^{*^{'}} β_{1 k}^{*^{'}} \circ β_{2 k}^{*^{'}} \circ β_{3 k}^{*^{'}}$ , it must have K = K′ and be invariant up to a permutation of {1,…,K}.

Condition 7 (Parameter space). The CP-decomposition of $J * = \sum_{k = 1}^{K} η_{k}^{*} β_{1 k}^{*} \circ β_{2 k}^{*} \circ β_{3 k}^{*}$ satisfies

{‖ J * ‖}_{o p} \leq C_{1} η_{max}^{*}, K = O (s), and R = η_{max}^{*} / η_{min}^{*} \leq C_{2}

for some absolute constants C₁,C₂.

Condition 8 (Parameter incoherence). The true tensor components are incoherent such that

Γ : = max_{k_{i} \neq k_{j}} {| β_{1 k_{i}}^{*}, β_{1 k_{j}}^{*} |, | β_{2 k_{i}}^{*}, β_{2 k_{j}}^{*} |, | β_{3 k_{i}}^{*}, β_{3 k_{j}}^{*} |} \leq C min {K^{- \frac{3}{4}} R^{- 1}, s^{- \frac{1}{2}}} .

Condition 9 (Random noise). We assume the random noise ${ϵ_{i}}_{i = 1}^{n}$ follows a sub-exponential tail with parameter σ satisfying $0 < σ < C \sum_{k = 1}^{K} η_{k}^{*}$ .

graphic file with name nihms-1621480-t0004.jpg

Open in a new tab

2). Proof of Theorem 5:

The main distinguished part of the proof for non-symmetric update is Lemma 16: one-step oracle estimator, which is parallel to Lemma 11. For the sake of completeness, we limit our attention to rank-one case and only provide the theoretical development for one-step oracle estimator in this subsection. The generalization to general rank case follows the exact same idea in the proof of symmetric update by incorporating the incoherence condition (8).

For rank-one non-symmetric tensor estimation, the model (VI.1) reduces to

y_{i} = η^{*} β_{1}^{*} \circ β_{2}^{*} \circ β_{3}^{*}, u_{i} \circ v_{i} \circ w_{i} + ϵ_{i}, for i = 1, \dots, n .

Suppose $| supp (β_{1}^{*}) | = s_{1}$ , $| supp (β_{2}^{*}) | = s_{2}$ , $| supp (β_{3}^{*}) | = s_{3}$ and denote s = max{s₁,s₂,s₃}. Define $F_{j}^{(t)} = supp (β_{j}^{*}) \cup supp (β_{j}^{(t)})$ , $F^{(t)} = \cup_{j = 1}^{3} F_{j}^{(t)}$ and the oracle estimator as

{\tilde{β}}_{1}^{(t + 1)} = φ_{\frac{μ}{ϕ} h (β_{1}^{(t)})} (β_{j}^{(t)} - \frac{μ}{ϕ} \nabla_{1} L {(β_{1}^{(t)}, β_{2}^{(t)}, β_{3}^{(t)})}_{F^{(t)}}),

where $h (β_{1}^{(t)})$ has the form of

\frac{\sqrt{4 log n p}}{n} {(\sum_{i = 1}^{n} {(η (u_{i}^{⊤} β_{1}^{(t)}) (v_{i}^{⊤} β_{2}^{(t)}) (w_{i}^{⊤} β_{3}^{(t)}) - y_{i})}^{2} * η^{\frac{2}{3}} {(v_{i}^{⊤} β_{2}^{(t)})}^{2} {(w_{i}^{⊤} β_{3}^{(t)})}^{2})}^{1 / 2},

(A.76)

The definitions of ${\tilde{β}}_{2}^{(t + 1)}$ and ${\tilde{β}}_{3}^{(t + 1)}$ are similar.

Lemma 16. Let t ≥ 0 be an integer. Suppose Conditions 6–9 hold and ${β_{j}^{(t)}, η}$ satisfies the following upper bound

max_{j = 1, 2, 3} {‖ \sqrt[3]{η} β_{j}^{(t)} - \sqrt[3]{η^{*}} β_{j}^{*} ‖}_{2} \leq \sqrt[3]{η^{*}} ε_{0}, | η - η * | \leq ε_{0}

(A.77)

with probability at least 1 − CO(1/n). Assume the step size μ satisfies 0 < μ < μ₀ for some small absolute constant μ₀ and s ≤ d ≤ Cs. Then ${{\tilde{β}}_{j}^{(t + 1)}}$ can be upper bounded as

max_{j = 1, 2, 3} {‖ \sqrt[3]{η} {\tilde{β}}_{j}^{(t + 1)} - \sqrt[3]{η *} β_{j}^{*} ‖}_{2} \leq (1 - \frac{μ}{12}) max_{j = 1, 2, 3} {‖ \sqrt[3]{η} β_{j}^{(t)} - \sqrt[3]{η *} β_{j}^{*} ‖}_{2} + μ \frac{3 σ}{{(\sqrt[3]{η *})}^{2}} \sqrt{\frac{3 s log p}{n}},

with probability at least 1 − 12s/n.

Proof. We focus on j = 1 first. To simplify the notation, we drop the superscript of iteration index t, and denote iteration index t+1 by +. Moreover, denote ${\bar{β}}_{j} = \sqrt[3]{η} β_{j}, {\bar{β}}_{j}^{+} = \sqrt[3]{η} β_{j}$ , ${\bar{β}}_{j}^{*} = \sqrt[3]{η *} β_{j}^{*}$ for j = 1, 2, 3. Then, the gradient function is rewritten as

\nabla_{1} L ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) = \sqrt[3]{η} \frac{2}{n} \sum_{i = 1}^{n} ((u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3})) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) u_{i} .

According to the definition of thresholded function, ${\tilde{β}}_{1}^{+}$ can be explicitly written by

{\tilde{β}}_{1}^{+} = φ_{\frac{μ}{ϕ} ({\bar{β}}_{1})} (β_{1} - \frac{μ}{ϕ} \nabla_{1} L {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F}) = β_{1} - \frac{μ}{ϕ} \nabla_{1} L {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F} + \frac{μ}{ϕ} h ({\bar{β}}_{1}) γ,

where $γ \in ℝ^{p}$ , supp(γ) ⊂−F and ‖γ‖_∞ ≤ 1. Then the oracle estimation error ${‖ \sqrt[3]{η} {\tilde{β}}_{1}^{+} - \sqrt[3]{η *} β_{1}^{*} ‖}_{2}$ can be decomposed by the gradient update effect and the thresholded effect,

{‖ \sqrt[3]{η} {\tilde{β}}_{1}^{+} - \sqrt[3]{η *} β_{1}^{*} ‖}_{2} = \underset{gradient update effect}{\underset{︸}{{‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} - μ \frac{\sqrt[3]{η}}{ϕ} \nabla_{1} L {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F} ‖}_{2}}} + \underset{thresholded effect}{\underset{︸}{μ \frac{\sqrt[3]{η}}{ϕ} | h ({\bar{β}}_{1}) | \sqrt{3 s}}} .

(A.78)

By using the tri-convex structure of $L ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})$ , we borrow the analysis tool for vanilla gradient descent [55] given sufficient good initial. Following this proof strategy, we decompose the gradient update effect in (A.78) by three parts,

{‖ \sqrt[3]{η} {\tilde{β}}_{1}^{+} - \sqrt[3]{η *} β_{1}^{*} ‖}_{2} \leq \underset{I_{1}}{\underset{︸}{{‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} - μ \frac{\sqrt[3]{η}}{ϕ} \nabla_{1} \tilde{L} {({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*})}_{F} ‖}_{2}}} + \underset{I_{2}}{\underset{︸}{μ \frac{\sqrt[3]{η}}{ϕ} {‖ \nabla_{1} \tilde{L} {({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*})}_{F} - \nabla_{1} \tilde{L} {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F} ‖}_{2}}} + \underset{I_{3}}{\underset{︸}{μ \frac{\sqrt[3]{η}}{ϕ} {‖ \nabla_{1} \tilde{L} {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F} - \nabla_{1} L {({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3})}_{F} ‖}_{2}}} + \underset{I_{4}}{\underset{︸}{μ \frac{\sqrt[3]{η}}{ϕ} | h ({\bar{β}}_{1}) | \sqrt{3 s},}}

where $\nabla_{1} \tilde{L}$ is the noiseless gradient as we defined in (A.42). We will bound I₁, I₂, I₃, I₄ successively in the following four subsections. For simplicity, during the following proof, we drop the index subscript F as we did in Section L. And $ϕ = \sum_{i = 1}^{n} y_{i}^{2}$ approximates η*² up to constant due to Lemma 15.

3). Bounding I₁:

In this section, let us denote

\begin{array}{l} \sqrt[3]{η} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) / ϕ = f ({\bar{β}}_{1}) \\ \sqrt[3]{η} \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) / ϕ = \nabla f ({\bar{β}}_{1}), \end{array}

(A.79)

Where $supp (\nabla f ({\bar{β}}_{1})) = F$ . When β₂ and β₃ are fixed, the update can be treated as a vanilla gradient descent update. The following proof follows three steps. The first two steps show that $f ({\bar{β}}_{1})$ is Lipshitz differentiable and strongly convex on the constraint set F, and the last step utilizes the classical convex gradient analysis.

Step One:

Verify $f ({\bar{β}}_{1})$ is L-Lipschitz differentiable. For any ${\bar{β}}_{1}^{(1)}$ and ${\bar{β}}_{1}^{(2)}$ whose support belong to F,

\nabla f ({\bar{β}}_{1}^{(1)}) - \nabla f ({\bar{β}}_{1}^{(2)}) = \frac{{(\sqrt[3]{η})}^{2}}{ϕ} \frac{2}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} ({\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)}) {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2}) u_{i} .

Then, there exist $π \in S^{s - 1}$ such that

{‖ \nabla f ({\bar{β}}_{1}^{(1)}) - \nabla f ({\bar{β}}_{1}^{(2)}) ‖}_{2} = \frac{{(\sqrt[3]{η})}^{2}}{ϕ} | \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} ({\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)}) {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2}) u_{i}^{⊤} π | .

Applying Lemma 2 with multiplying $({\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)}) \circ {\bar{β}}_{2}^{*} \circ {\bar{β}}_{3}^{*}$ it shows

| \sum_{i = 1}^{n} [(u_{i}^{⊤} ({\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)}) (u_{i}^{⊤} π) {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2})] | \leq (1 + δ_{n, p, s}) {‖ {\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)} ‖}_{2} η^{* \frac{4}{3}},

with probability at least 1 − 10/n³, where δ_n,p,s is defined in (IV.7). Under Condition (5) with some constant adjustments, we obtain

{‖ \nabla f ({\bar{β}}_{1}^{(1)}) - \nabla f ({\bar{β}}_{1}^{(2)}) ‖}_{2} \leq \frac{57}{16} {‖ {\bar{β}}_{1}^{(1)} - {\bar{β}}_{1}^{(2)} ‖}_{2} .

(A.80)

with probability at least 1 − 10/n³. Therefore, $f ({\bar{β}}_{1})$ is Lipschitz differentiable with Lipschitz constant $L = \frac{57}{8}$ .

Step Two:

Verify $f ({\bar{β}}_{1})$ is α-strongly convex. It is equivalent to prove that $\nabla^{2} f ({\bar{β}}_{1}) ≽ m I_{p}$ . Based on the inequality (3.3.19) in [63], it shows that

λ_{min} (\nabla^{2} (f ({\bar{β}}_{1}))) \geq λ_{min} (E (\nabla^{2} f ({\bar{β}}_{1}))) - λ_{max} (\nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1})) .

(A.81)

The lower bound of $λ_{min} (\nabla^{2} (f ({\bar{β}}_{1})))$ breaks into two parts: an lower bound for $λ_{min} (E (\nabla^{2} f ({\bar{β}}_{1})))$ , and an upper bound for $λ_{max} (\nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1}))$ . The Hessian matrix of $f ({\bar{β}}_{1})$ is given by

\nabla^{2} f ({\bar{β}}_{1}) = \frac{{(\sqrt[3]{η})}^{2}}{ϕ} \frac{2}{n} \sum_{i = 1}^{n} {(v_{1}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} u_{i} u_{i}^{⊤} .

Since u_i,v_i,w_i are independent with each other, we have $E (\nabla^{2} f ({\bar{β}}_{1})) = 2 I$ , which implies $λ_{min} (E (\nabla^{2} f ({\bar{β}}_{1}))) \geq 2$ . On the other hand,

λ_{max} (\nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1}))) = {‖ \nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1})) ‖}_{2} \leq a^{⊤} (\nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1}))) b = \frac{2}{n} \sum_{i = 1}^{n} {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} (u_{i}^{⊤} a) (u_{i}^{⊤} b) - E (\sum_{i = 1}^{n} {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} (u_{i}^{⊤} a) (u_{i}^{⊤} b)) η^{* - \frac{4}{3}} .

where $a, b \in S^{s - 1}$ . Equipped with Lemma 2, it yields that with probability at least 1 − 10/n³,

λ_{max} (\nabla^{2} f ({\bar{β}}_{1}) - E (\nabla^{2} f ({\bar{β}}_{1}))) \leq 2 δ_{n, s, p} .

Together with the lower bound of $λ_{min} (E (\nabla^{2} f ({\bar{β}}_{1})))$ , we have

λ_{min} (\nabla^{2} f ({\bar{β}}_{1})) \geq 2 - 2 δ_{n, p, s},

Under Condition 5, the minimum eigenvalue of Hessian matrix $\nabla^{2} f ({\bar{β}}_{1})$ is lower bounded by $\frac{19}{10}$ with probability at least 1−10/n³. This guarantees that $f ({\bar{β}}_{1})$ is strongly-convex with $α = \frac{19}{10}$ .

Step Three:

Combining the Lipschitz condition, strongly-convexity and Lemma 3.11 in [55], it shows that

(\nabla f ({\bar{β}}_{1}) - \nabla f {({\bar{β}}_{1}^{*})}^{⊤}) ({\bar{β}}_{1} - {\bar{β}}^{*}) \geq \frac{α L}{α + L} {‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} ‖}_{2}^{2} + \frac{1}{α + L} {‖ \nabla f ({\bar{β}}_{1}) - \nabla f ({\bar{β}}_{1}^{*}) ‖}_{2}^{2} .

Since the gradient vanishes at the optimal point, the above inequality times 2μ simplifies to

- 2 μ \nabla f {({\bar{β}}_{1})}^{⊤} ({\bar{β}}_{1} - {\bar{β}}_{1}^{*}) \leq - \frac{2 μ α L}{α + L} {‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} ‖}_{2}^{2} - \frac{2 μ}{α + L} {‖ \nabla f ({\bar{β}}_{1}) ‖}_{2}^{2} .

(A.82)

Now it’s sufficient to bound ${‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} - μ \nabla f ({\bar{β}}_{1}) ‖}_{2}$ as follows

{‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} - μ \nabla f ({\bar{β}}_{1}) ‖}_{2}^{2} = {‖ {\bar{β}}_{1}^{t} - {\bar{β}}_{1}^{*} ‖}_{2}^{2} + μ^{2} {‖ \nabla f ({\bar{β}}_{1}) ‖}_{2}^{2} - 2 μ \nabla f {({\bar{β}}_{1})}^{⊤} ({\bar{β}}_{1} - {\bar{β}}^{*}) \leq (1 - 2 μ \frac{α L}{α + L}) {‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} ‖}_{2}^{2} + μ (μ - \frac{2}{α + L}) {‖ \nabla f ({\bar{β}}_{1}) ‖}_{2}^{2} .

where L,α are Lipschitz constant and strongly convexity parameter, respectively. If $μ < \frac{80}{361}$ , the last term can be neglected and we obtain the desired upper bound,

{‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} - μ \frac{\sqrt[3]{η}}{ϕ} \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) ‖}_{2} \leq (1 - 3 μ) {‖ {\bar{β}}_{1} - {\bar{β}}_{1}^{*} ‖}_{2},

(A.83)

with probability 1 − 20/n³. This ends the proof. ■

4). Bounding I₂:

For simplicity, we write $z_{1} = {\bar{β}}_{1} - {\bar{β}}_{1}^{*}$ , $z_{2} = {\bar{β}}_{2} - {\bar{β}}_{2}^{*}$ , $z_{3} = {\bar{β}}_{3} - {\bar{β}}_{2}^{*}$ . By the definition of noiseless gradient, it suffices to decompose I₂ by

η^{- \frac{1}{3}} {‖ \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) - \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) ‖}_{2} \leq {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (v_{i}^{⊤} z_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) (w_{i}^{⊤} z_{3}) u_{i} ‖}_{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (v_{i}^{⊤} z_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) (w_{i}^{⊤} {\bar{β}}_{3}^{*}) u_{i} ‖}_{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) (w_{i}^{⊤} z_{3}) u_{i} ‖}_{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} z_{1}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (v_{i}^{⊤} z_{2}) (w_{i}^{⊤} {\bar{β}}_{3}^{*}) (w_{i}^{⊤} z_{3}) u_{i} ‖}_{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} z_{1}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (v_{i}^{⊤} z_{2}) {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} u_{i} ‖}_{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} (u_{i}^{⊤} z_{1}) {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} (w_{i}^{⊤} {\bar{β}}_{3}^{*}) (w_{i}^{⊤} z_{3}) u_{i} ‖}_{2} .

Repeatedly using Lemma 2, we obtain

η^{- \frac{1}{3}} {‖ \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) - \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) ‖}_{2} \leq (1 + δ_{n, p, s}) [{(1 + ε_{0})}^{3} ε_{0} + {(1 + ε_{0})}^{3} + {(1 + ε_{0})}^{3} + ε_{0}^{2} + 2 ε_{0}] η^{* \frac{4}{3}} max_{j} {‖ z_{j} ‖}_{2} \leq \frac{5}{2} (1 + δ_{n, p, s}) η^{* \frac{4}{3}} \leq \frac{5}{2} (1 + δ_{n, p, s}) η^{* \frac{4}{3}} {max}_{j} {‖ z_{j} ‖}_{2},

for sufficiently small ε₀ with probability at least 1 − 60/n³. Under Condition 5, it suffices to get

\frac{\sqrt[3]{η}}{ϕ} {‖ \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) - \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}^{*}, {\bar{β}}_{3}^{*}) ‖}_{2} \leq \frac{8}{3} max_{j} {‖ {\bar{β}}_{j} - {\bar{β}}_{j}^{*} ‖}_{2},

(A.84)

with probability at least 1 − 6/n.

5). Bounding I₃:

I₃ quantifies the statistical error. By the definition of noiseless gradient and noisy gradient, we have

\frac{\sqrt[3]{η}}{ϕ} {‖ \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) - \nabla_{1} L ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) ‖}_{2} = \frac{{(\sqrt[3]{η})}^{2}}{ϕ} {‖ \frac{2}{n} \sum_{i = 1}^{n} ϵ_{i} (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) u_{i} ‖}_{2} .

The proof of this part essentially coincides with the proof for symmetric tensor estimation. Combining Lemmas 1 and 23, we have

| \frac{2}{n} \sum_{i = 1}^{n} ϵ_{i} (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) u_{i j} | \leq C {(1 + ε_{0})}^{2} η^{* \frac{2}{3}} σ \frac{{(log n)}^{\frac{3}{2}}}{\sqrt{n}},

with probability at least 1 − 4/n. Applying union bound over 3s coordinates, it suffices to get

ℙ (max_{j \in [3 s]} | \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) u_{i j} | \geq C {(1 + ε_{0})}^{2} η^{* - \frac{2}{3}} σ \frac{{(log n)}^{\frac{3}{2}}}{\sqrt{n}}) \leq \frac{12 s}{n} .

Therefore, we reach

\frac{\sqrt[3]{η}}{ϕ} {‖ \nabla_{1} \tilde{L} ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) - \nabla_{1} L ({\bar{β}}_{1}, {\bar{β}}_{2}, {\bar{β}}_{3}) ‖}_{2} \leq 2 C η^{* - \frac{2}{3}} σ \sqrt{\frac{3 s {(log n)}^{3}}{n}},

with probability at least 1 − 12s/n.

6). Bounding I₄:

According to the definition of thresholding level h(β₁) in (A.76), we can bound the square as follows,

\frac{{(\sqrt[3]{η})}^{2}}{ϕ^{2}} h^{2} ({\bar{β}}_{1}) = \frac{{(\sqrt[3]{η})}^{4}}{ϕ^{2}} \frac{4 log n p}{n^{2}} \sum_{i = 1}^{n} {((u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) - (u_{i}^{⊤} {\bar{β}}_{1}^{*}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (w_{i}^{⊤} {\bar{β}}_{3}^{*}) - ϵ_{i})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2}

Based on the basic inequality (a + b)² ≤ 2(a² + b²), we have

{((u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) - (u_{i}^{⊤} {\bar{β}}_{1}^{*}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (w_{i}^{⊤} {\bar{β}}_{3}^{*}) - ϵ_{i})}^{2} \leq 2 {((u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) - (u_{i}^{⊤} {\bar{β}}_{1}^{*}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (w_{i}^{⊤} {\bar{β}}_{3}^{*}))}^{2} + 2 ϵ_{i}^{2} .

Denote I₁ and I₂ corresponding to optimization error and statistical error,

\begin{array}{l} I_{1} = \frac{{(\sqrt[3]{η})}^{4}}{ϕ^{2}} \frac{4 log n p}{n^{2}} \sum_{i = 1}^{n} ((u_{i}^{⊤} {\bar{β}}_{1}) (v_{i}^{⊤} {\bar{β}}_{2}) (w_{i}^{⊤} {\bar{β}}_{3}) - (u_{i}^{⊤} {\bar{β}}_{1}^{*}) (v_{i}^{⊤} {\bar{β}}_{2}^{*}) (w_{i}^{⊤} {\bar{β}}_{3}^{*})) \\ I_{2} = \frac{{(\sqrt[3]{η})}^{4}}{ϕ^{2}} \frac{4 log n p}{n^{2}} \sum_{i = 1}^{n} ϵ_{i}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} . \end{array}

Next, I₁ is decomposed by some high-order polynomials as follows

I_{1} = \frac{{(\sqrt[3]{η})}^{4}}{ϕ^{2}} \frac{4 log n p}{n^{2}} (\sum_{i = 1}^{n} {(u_{i}^{⊤} z_{1})}^{2} {(v_{i}^{⊤} z_{2})}^{2} {(w_{i}^{⊤} z_{3})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} z_{1})}^{2} {(v_{i}^{⊤} z_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} z_{1})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} z_{3})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} z_{1})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} {\bar{β}}_{1}^{*})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2}^{*})}^{2} {(w_{i}^{⊤} z_{3})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} {\bar{β}}_{1}^{*})}^{2} {(v_{i}^{⊤} z_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3}^{*})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} + \sum_{i = 1}^{n} {(u_{i}^{⊤} {\bar{β}}_{1}^{*})}^{2} {(v_{i}^{⊤} z_{2})}^{2} {(w_{i}^{⊤} z_{3})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2}) .

(A.85)

Each term contains the product of Gaussian random vectors form up to power ten. For the first term, by using Lemma 1,

\frac{1}{n} \sum_{i = 1}^{n} {(u_{i}^{⊤} z_{1})}^{2} {(v_{i}^{⊤} z_{2})}^{2} {(w_{i}^{⊤} z_{3})}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} \leq {(1 + ε_{0})}^{4} ε_{0}^{4} (1 + C \frac{{(log n)}^{5}}{\sqrt{n}}) η^{* \frac{8}{3}} max_{j = 1, 2, 3} {‖ z_{j} ‖}_{2}^{2},

with probability at least 1−1/n. Similar bounds holds for other terms. As long as n ≥ C log¹⁰ n, we have with probability at least 1 − 7/n,

I_{1} \leq \frac{7 log p}{n} max_{j = 1, 2, 3} {‖ {\bar{β}}_{j} - {\bar{β}}_{j}^{*} ‖}_{2}^{2} .

(A.86)

Now we turn to bound I₂. For fixed {ϵ_i}, we have,

| \sum_{i = 1}^{n} ϵ_{i}^{2} {(v_{i}^{⊤} {\bar{β}}_{2})}^{2} {(w_{i}^{⊤} {\bar{β}}_{3})}^{2} - \sum_{i = 1}^{n} ϵ_{i}^{2} {‖ {\bar{β}}_{2} ‖}_{2}^{2} {‖ {\bar{β}}_{3} ‖}_{2}^{2} | \leq C {(log n)}^{2} {‖ ϵ^{2} ‖}_{2} {‖ {\bar{β}}_{2} ‖}_{2}^{2} {‖ {\bar{β}}_{3} ‖}_{2}^{2} .

with probability at least 1 − n⁻¹. Combining with Lemma 23,

I_{2} \leq 4 σ^{2} η^{* \frac{4}{3}} \frac{log p}{n} .

(A.87)

Putting (A.86) and (A.87) together, the thresholded effect can be bound by

\frac{\sqrt[3]{η}}{ϕ} | h (β_{1}) | \leq \sqrt{\frac{7 log n p}{n}} max_{j = 1, 2, 3} {‖ {\bar{β}}_{j} - {\bar{β}}_{j}^{*} ‖}_{2} + \frac{2 σ}{{(\sqrt[3]{η^{*}})}^{2}} \sqrt{\frac{log n p}{n}},

(A.88)

with probability at least 1 − 8/n, provided n ≳ (logn)¹⁰. ■

7). Summary:

Putting the upper bounds (A.83), (A.84) and (A.88) together, we obtain that if step size μ satisfies 0 < μ < μ₀ for some small μ₀,

{‖ \sqrt[3]{η} {\tilde{β}}_{1}^{+} - \sqrt[3]{η *} β_{1}^{*} ‖}_{2} \leq (1 - \frac{μ}{12}) max_{j = 1, 2, 3} {‖ {\bar{β}}_{j} - {\bar{β}}_{j}^{*} ‖}_{2} + μ \frac{3 σ}{{(\sqrt[3]{η *})}^{2}} \sqrt{\frac{3 s log p}{n}},

with probability at least 1−12s/n. This finishes our proof. ■

P. Matrix Form Gradient and Stochastic Gradient descent

1). Matrix Formulation of Gradient:

In this section, we provide detail derivations for (III.7) and (VI.5).

Lemma A.1. Let $η = (η_{1}, \dots, η_{K}) \in ℝ^{K \times 1}, X = (x_{1}, \dots, x_{n}) \in ℝ^{p \times n}$ and $B = (β_{1}, \dots, β_{K}) \in ℝ^{p \times K}$ . The gradient of symmetric tensor estimation empirical risk function (III.5) can be written in a matrix form as follows

\nabla_{B} L (B, η) = \frac{6}{n} {[{({(B^{⊤} X)}^{⊤})}^{3} η - y]}^{⊤} {[{({({(B^{⊤} X)}^{⊤})}^{2} ⊙ η^{⊤})}^{⊤} ⊙ X]}^{⊤} .

Proof. First let’s have a look at the gradient for k-th component,

\nabla L_{k} (β_{k}) = \frac{6}{n} (\sum_{k = 1}^{K} η_{k} {(x_{i}^{⊤} β_{k})}^{3} - y_{i}) η_{k} (x_{i}^{⊤} β_{k}) x_{i} \in ℝ^{p \times 1},

for k = 1,…,K. Correspondingly, each part can be written as a matrix form,

{({(\underset{K \times n}{\underset{︸}{B^{⊤} X}})}^{⊤})}^{3} η - y \in ℝ^{n \times 1} {({({(B^{⊤} X)}^{⊤})}^{2} ⊙ η^{⊤})}^{⊤} ⊙ X \in ℝ^{p K \times n} .

This implies that ${[{({(B^{⊤} X)}^{⊤})}^{3} η - y]}^{⊤} {[{({({(B^{⊤} X)}^{⊤})}^{2} ⊙ η^{⊤})}^{⊤} ⊙ X]}^{⊤} \in ℝ^{1 \times p K}$ . Note that $\nabla_{B} L (B, η) = (\nabla L_{1} {(β_{1})}^{⊤}, \dots, \nabla L_{K} {(β_{K})}^{⊤}) \in ℝ^{1 \times p K}$ . The conclusion can be easily derived. ■

Lemma 17. Let $η = (η_{1}, \dots, η_{K}) \in ℝ^{K \times 1}$ , $U = (u_{1}, \dots, u_{n}) \in ℝ^{p_{1} \times n}$ . The gradient of non-symmetric tensor estimation empirical risk function (VI.3) can be written in a matrix form as follows

\nabla_{B_{1}} L (B_{1}, B_{2}, B_{3}, η) = D^{⊤} {(C_{1}^{⊤} ⊙ U)}^{⊤},

where e_m and $ℝ^{p}$ .

Proof. Recall that {*,⊙} represent Hadamard product and Khatri-Rao product respectively. Then the dimensionality of D,C₁,C₁ ⊙ U can be calculated as follows

\begin{array}{l} D = \underset{n \times K}{\underset{︸}{{(B_{1}^{⊤} U)}^{⊤}}} * \underset{n \times K}{\underset{︸}{{(B_{2}^{⊤} V)}^{⊤}}} * \underset{n \times K}{\underset{︸}{{(B_{3}^{⊤} W)}^{⊤}}} η - y \in ℝ^{n \times 1}, \\ C_{1} = {(B_{2}^{⊤} V)}^{⊤} * {(B_{3}^{⊤} W)}^{⊤} ⊙ η^{⊤} \in ℝ^{n \times K}, \\ C_{1}^{⊤} ⊙ U \in ℝ^{K p_{1} \times n} . \end{array}

Therefore,

\nabla_{B_{1}} L (B_{1}, B_{2}, B_{3}, η) = D^{⊤} {(C_{1}^{⊤} ⊙ U)}^{⊤} = (\nabla_{1} L {(β_{1})}^{⊤}, \dots, \nabla_{K} L {(β_{K})}^{⊤}) .

2). Stochastic Gradient descent:

Stochastic thresholded gradient descent is a stochastic approximation of the gradient descent optimization method. Note that the empirical risk function (III.5) that can be written as a sum of differentiable functions. Followed by (III.7), the gradient of (III.5) evaluated at i-th sketching {y_i,x_i} can be written as

\nabla_{B} L_{i} (B, η) = [{({(B^{⊤} x_{i})}^{⊤})}^{3} η - y_{i}] * {[{({({(B^{⊤} x_{i})}^{⊤})}^{2} ⊙ η^{⊤})}^{⊤} ⊙ x_{i}]}^{⊤} \in ℝ^{1 \times p K},

Thus, the overall gradient $\nabla_{B} L_{i} (B, η)$ defined in (III.7) can be expressed as a summand of $\nabla_{B} L_{i} (B, η)$ ,

\nabla_{B} L_{i} (B, η) = \frac{1}{n} \sum_{i = 1}^{n} \nabla_{B} L_{i} (B, η) .

The thresholded step remains the same as Step 3 in Algorithm1. Then the symmetric update of stochastic thresholded gradient descent within one iteration is summarized by

vec (B^{(t + 1)}) = φ \frac{μ_{S G D}}{ϕ} h (B^{(t)}) (vec (B^{(t)}) - \frac{μ_{S G D}}{ϕ} \nabla_{B} L_{i} (B^{(t)})) .

Q. Technical Lemmas

Lemma 18. Suppose $x \in ℝ^{p}$ is a standard Gaussian random vector. For any non-random vector $a, b, c \in ℝ^{p}$ , we have the following tensor expectation calculation,

E ((a^{⊤} x) (b^{⊤} x) (c^{⊤} x) x \circ x \circ x) + (a \circ b \circ c + a \circ c \circ b + b \circ a \circ + b \circ c \circ a + c \circ b \circ a + c \circ a \circ b) + 3 \sum_{m = 1}^{p} (a \circ e_{m} \circ e_{m} (b^{⊤} c) + e_{m} \circ b \circ e_{m} (a^{⊤} c) + e_{m} \circ e_{m} \circ c (a^{⊤} b)),

(A.89)

where e_m is a canonical vector in $ℝ^{p}$ .

Proof. Recall that for a standard Gaussian random variable x, its odd moments are zero and even moments are $E (x^{6}) = 15$ , $E (x^{4}) = 4$ . Expanding the LHS of (A.89) and comparing LHS and RHS, we will reach the conclusion. Details are omitted here. ■

Lemma 19. Suppose $u \in ℝ^{p_{1}}$ , $v \in ℝ^{p_{2}}$ , $w \in ℝ^{p_{3}}$ are independent standard Gaussian random vectors. For any non-random vector $a \in ℝ^{p_{1}}$ , $b \in ℝ^{p_{2}}$ , $c \in ℝ^{p_{3}}$ , we have the following tensor expectation calculation

E ((a^{⊤} u) (b^{⊤} v) (c^{⊤} w) u \circ v \circ w) = a \circ b \circ c .

(A.90)

Proof. Due to the independence among u,v,w, the conclusion is easy to obtain by using the moment of standard Gaussian random variable. ■

Note that in the left side of (A.89), it involves an expectation of rank-one tensor. When multiplying any non-random rank-one tensor with same dimensionality, i.e., a₁ ∘ b₁ ∘ c₁, on both sides, it will facilitate us to calculate the expectation of product of Gaussian vectors, see next Lemma for details.

Lemma A.2. Suppose $x \in ℝ^{p}$ is a standard Gaussian random vector. For any non-random vector $a, b, c, d \in ℝ^{p}$ , we have the following expectation calculation

\begin{array}{l} E {(x^{⊤} a)}^{6} = 15 ‖ a ‖_{2}^{6}, \\ E {(x^{⊤} a)}^{5} (x^{⊤} b) = 15 ‖ a ‖_{2}^{4} (a^{⊤} b), \\ E {(x^{⊤} a)}^{4} {(x^{⊤} b)}^{2} = 12 ‖ a ‖_{2}^{2} {(a^{⊤} b)}^{2} + 3 ‖ a ‖_{2}^{4} ‖ b ‖_{2}^{2}, \\ E {(x^{⊤} a)}^{3} {(x^{⊤} b)}^{3} = 6 {(a^{⊤} b)}^{3} + 9 (a^{⊤} b) ‖ a ‖_{2}^{2} ‖ b ‖_{2}^{2}, \\ E {(x^{⊤} a)}^{3} {(x^{⊤} b)}^{2} (x^{⊤} c) = 6 {(a^{⊤} b)}^{2} (a^{⊤} c) + 6 (a^{⊤} b) (b^{⊤} c) (a^{⊤} a) + 3 (a^{⊤} c) (b^{⊤} b) (a^{⊤} a), \\ E {(x^{⊤} a)}^{2} (x^{⊤} b) {(x^{⊤} c)}^{2} (x^{⊤} d) = 2 {(a^{⊤} c)}^{2} (b^{⊤} d) + 4 (a^{⊤} c) (b^{⊤} c) (a^{⊤} d) + 6 (a^{⊤} c) (a^{⊤} b) (c^{⊤} d) + 3 (c^{⊤} x) (b^{⊤} d) (a^{⊤} a) . \end{array}

Proof. Note that $E ({(x^{⊤} a)}^{3} {(x^{⊤} b)}^{3}) = E ({(x^{⊤} a)}^{3} 〈 x \circ x \circ x, b \circ b \circ b 〉)$ Then we can apply the general result in Lemma 18. Comparing both sides, we will obtain the conclusion. Others part follows the similar strategy. ■

Next lemma provides a probabilistic concentration bound for non-symmetric rank-one tensor under tensor spectral norm.

Lemma 20. Suppose $X = {(x_{1}^{⊤}, \dots, x_{n}^{⊤})}^{⊤}$ , $Y = {(y_{1}^{⊤}, \dots, y_{n}^{⊤})}^{⊤}$ , $Z = {(z_{1}^{⊤}, \dots, z_{n}^{⊤})}^{⊤}$ are three n × p random matrices. The ψ₂-norm of each entry is bounded, s.t. ‖X_ij‖_ψ2 = K_x,‖Y_ij‖_ψ2 = K_y,‖Z_ij‖_ψ2 = K_z. We assume the row of X,Y, Z are independent. There exists an absolute constant C such that,

\begin{array}{l} ℙ ({‖ \frac{1}{n} \sum_{i = 1}^{n} [x_{i} \circ y_{i} \circ z_{i} - E (x_{i} \circ y_{i} \circ z_{i})] ‖}_{s} \geq C K_{x} K_{y} K_{z} δ_{n, p, s}) \leq p^{- 1} . \\ ℙ ({‖ \frac{1}{n} \sum_{i = 1}^{n} [x_{i} \circ x_{i} \circ x_{i} - E (x_{i} \circ x_{i} \circ x_{i})] ‖}_{s} \geq C K_{x}^{3} δ_{n, p, s}) \leq p^{- 1} . \end{array}

Here, ‖ · ‖_s is the sparse tensor spectral norm defined in (II.3) and $δ_{n, p, s} = \sqrt{s log (e p / s) / n} + \sqrt{s^{3} log {(e p / s)}^{3} / n^{2}}$ .

Proof. Bounding spectral norm always relies on the construction of the ϵ-net. Since we will bound a sparse tensor spectral norm, our strategy is to discrete the sparse set and construct the ϵ-net on each one. Let us define a sparse set $B_{0} = {x \in ℝ^{p}, ‖ x ‖_{2} = 1, ‖ x ‖_{0} \leq s}$ . And let $B_{0, s}$ be the s-dimensional set defined by $B_{0, s} = {x \in ℝ^{s}, ‖ x ‖_{2} = 1}$ . Note that $B_{0}$ is corresponding to s-sparse unit vector set which can be expressed as a union of subsets of dimension s by expanding some zeros, namely $B_{0} = \cup B_{0, s}$ . There should be at most < such set.

Recalling the definition of sparse tensor spectral norm in (II.3), we have

A = {‖ \frac{1}{n} \sum_{i = 1}^{n} [x_{i} \circ y_{i} \circ z_{i} - E (x_{i} \circ y_{i} \circ z_{i})] ‖}_{s} sup_{χ_{1}, χ_{2}, χ_{3} \in B_{0}} | \frac{1}{n} \sum_{i = 1}^{n} [〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉 - E (〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉)] | .

Instead of constructing the ϵ-net on $B_{0}$ , we will construct an ϵ-net for each of subsets $B_{0, s}$ . Define $N_{B_{0, s}}$ as the 1/2-set of $B_{0, s}$ . From Lemma 3.18 in [64], the cardinality of $N_{0, s}$ is bounded by 5^s. By Lemma 21, we obtain

\begin{matrix} sup \\ χ_{1}, χ_{2}, χ_{3} \in B_{0, s} \end{matrix} | \frac{1}{n} \sum_{i = 1}^{n} [〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉 - E (〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉)] | \leq 2^{3} sup_{χ_{1}, χ_{2}, χ_{3} \in N_{B_{0, a}}} | \frac{1}{n} \sum_{i = 1}^{n} [〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉 - E (〈 x_{i}, χ_{1} 〉 〈 y_{i}, χ_{2} 〉 〈 z_{i}, χ_{3} 〉)] | .

(A.91)

By rotation invariance of sub-Gaussian random variable, $〈 x_{i}, χ_{1} 〉$ , $〈 y_{i}, χ_{2} 〉$ , $〈 z_{i}, χ_{3} 〉$ are still sub-Gaussian random variables with ψ₂-norm bounded by K_x,K_y,K_z, respectively. Applying Lemma 1 and union bound over $N_{B_{0, s}}$ , the right hand side of (A.91) can be bounded by

RHS \geq 8 K_{x} K_{y} K_{z} C (\sqrt{\frac{log δ^{- 1}}{n}} + \sqrt{\frac{{(log δ^{- 1})}^{3}}{n^{2}}}),

with probability smaller than (5^s)³δ for any 0 < δ < 1.

Lastly, taking the union bound over all possible subsets $B_{0, s}$ yields that

ℙ (A \geq 8 K_{x} K_{y} K_{z} C (\sqrt{\frac{log δ^{- 1}}{n}} + \sqrt{\frac{{(log δ^{- 1})}^{3}}{n^{2}}})) \leq {(\frac{e p}{s})}^{s} {(5^{s})}^{3} δ = {(\frac{125 e p}{s})}^{s} δ .

Letting $p^{- 1} = {(\frac{125 e p}{s})}^{s} δ$ , we obtain with probability at least 1 − 1/p,

A \leq C K_{x} K_{y} K_{z} (\sqrt{\frac{s log (p / s)}{n}} + \sqrt{\frac{s^{3} {log}^{3} (p / s)}{n^{2}}}),

with some adjustments on constant C. The proof for symmetric case is similar to non-symmetric case so we omit here.

Lemma 21 (Tensor Covering Number(Lemma 4 in [65])). Let $ℕ$ be an ϵ-net for a set B associated with a norm ‖ · ‖. Then, the spectral norm of a d-mode tensor $A$ is bounded by

sup_{x_{1}, \dots, x_{d - 1} \in B} {‖ A \times_{1} x_{1} \dots \times_{d - 1} x_{d - 1} ‖}_{2} \leq {(\frac{1}{1 - ε})}^{d - 1} sup_{x_{1} \dots x_{d - 1} \in ℕ} {‖ A \times_{1} x_{1} \dots \times_{d - 1} x_{d - 1} ‖}_{2} .

This immediately implies that the spectral norm of a d-mode tensor $A$ is bounded by

‖ A ‖_{2} \leq {(\frac{1}{1 - ϵ})}^{d - 1} sup_{x_{1} \dots x_{d - 1} \in N} {‖ A \times_{1} x_{1} \dots \times_{d - 1} x_{d - 1} ‖}_{2},

where $ℕ$ is the ϵ-net for the unit sphere $S^{n - 1}$ in $ℝ^{n}$ .

Lemma 22 (Sub-Gaussianess of the Product of Random Variables). Suppose X₁ is a bounded random variable with |X₁| ≤ K₁ almost surely for some K₁ and X₂ is a sub-Gaussian random variable with Orlicz norm ‖X₂‖_ψ2K₂. Then X₁X₂ is still a sub-Gaussian random variable with Orlicz norm ‖X₁X₂‖_ψ2 = K₁K₂.

Proof: Following the definition of sub-Gaussian random variable, we have

ℙ (| X_{1} X_{2} | > t) = ℙ (| X_{2} | > \frac{t}{| X_{1} |}) \leq ℙ (| X_{2} | > \frac{t}{| K_{1} |}) \leq exp (1 - t^{2} / K_{1}^{2} K_{2}^{2}),

holds for all t ≥ 0. This ends the proof. ■

Lemma 23 (Tail Probability for the Sum of Sub-exponential Random Variables (Lemma A.7 in [48])). Suppose ϵ₁,…, ϵ_n are independent centered sub-exponential random variables with

σ : = max_{1 \leq i \leq n} {‖ ϵ_{i} ‖}_{ψ_{1}} .

Then with probability at least 1 − 3/n, we have

| \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} | \leq C_{0} σ \sqrt{\frac{log n}{n}}, ‖ ϵ ‖_{\infty} \leq C_{0} σ log n, | \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} | \leq C_{0} σ^{2}, | \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{4} | \leq C_{0} σ^{4},

for some constant C₀

Lemma 24 (Tail Probability for the Sum of Weibull Distributions (Lemma 3.6 in [34])). Let α ∈ [1,2] and Y₁,…,Y_n be independent symmetric random variables satisfying. Then for every vector $a = (a_{1}, \dots, a_{n}) \in ℝ^{n}$ and every t ≥ 0,

ℙ (| \sum_{i = 1}^{n} a_{i} Y_{i} | \geq t) \leq 2 exp (- c min (\frac{t^{2}}{‖ a ‖_{2}^{2}}, \frac{t^{α}}{‖ a ‖_{α *}^{α}}))

Proof. It is a combination of Corollaries 2.9 and 2.10 in [58].

Lemma 25 (Moments for the Sum of Weibull Distributions (Corollary 1.2 in [66])). Let X₁,X₂,…,X_n be a sequence of independent symmetric random variables satisfying $ℙ (| Y_{i} | \geq t) = exp (- t^{α})$ , where 0 < α < 1. Then, for p ≥ 2 and some constant C(α) which depends only on α,

{‖ \sum_{i = 1}^{n} a_{i} X_{i} ‖}_{p} \leq C (α) (\sqrt{p} ‖ a ‖_{2} + p^{1 / α} ‖ a ‖_{\infty}) .

Lemma 26 (Stein’s Lemma [56]). Let $x \in ℝ^{d}$ be a random vector with joint density function p(x). Suppose the score function ∇_x logp(x) exists. Consider any continuously differentiable function $G (x) : ℝ^{d_{x}} \to ℝ$ . Then, we have

E [G (x) \cdot \nabla_{x} log p (x)] = - E [\nabla_{x} G (x)] .

Lemma 27 (Khinchin-Kahane Inequality (Theorem 1.3.1 in ${a_{i}}_{i = 1}^{n}$ a finite non-random sequence, ${ε_{i}}_{i = 1}^{n}$ be a sequence of independent Rademacher variables and 1 < p < q < ∞. Then

{‖ \sum_{i = 1}^{n} ε_{i} a_{i} ‖}_{q} \leq {(\frac{q - 1}{p - 1})}^{1 / 2} {‖ \sum_{i = 1}^{n} ε_{i} a_{i} ‖}_{p} .

Lemma 28. Suppose each non-zero element of ${x_{k}}_{k = 1}^{K}$ is drawn from standard Gaussian distribution and ‖x_k‖₀ ≤ s for k ∈ [K]. Then we have for any 0 < δ ≤ 1,

ℙ (max_{1 \leq k_{1} < k_{2} \leq K} | 〈 x_{k_{1}}, x_{k_{2}} 〉 | \leq C \sqrt{s} \sqrt{log K + log 1 / δ}) \geq 1 - δ,

where C is some constant.

Proof. Let us denote $S_{k_{1} k_{2}} \subset [1, 2, \dots, p]$ as an index set such that for any $i, j \in S_{k_{1} k_{2}}$ , we have $x_{k_{1} i} \neq 0$ and $x_{k_{2} j} \neq 0$ . From the definition of $S_{k_{1} k_{2}}$ , we know that $| S_{k_{1} k_{2}} | \leq s$ and $x_{k_{1}}^{⊤} x_{k_{2}} = \sum_{j = 1}^{p} x_{k_{1} j} x_{k_{2} j} = \sum_{j \in S_{k_{1} k_{2}}} x_{k_{1} j} x_{k_{2} j}$ . We apply standard Hoeffding’s concentration inequality,

ℙ (| x_{k_{1}}, x_{k_{2}} | \geq t) = ℙ (| \sum_{j \in S_{k_{1} k_{2}}} x_{k_{1} j} x_{k_{2} j} | \geq t) \leq e exp (- \frac{c t^{2}}{s}) .

Letting ct²/s = log(1/δ), we reach the conclusion.

Contributor Information

Botao Hao, Department of Electrical Engineering, Princeton University, Princeton, NJ 08540,.

Anru Zhang, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706,.

Guang Cheng, Department Statistics, Purdue University, West Lafayette, IN 47906,.

References

[1].Kroonenberg PM, Applied Multiway Data Analysis. Wiley Series in Probability and Statistics, 2008. [Google Scholar]
[2].Kolda T and Bader B, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009. [Google Scholar]
[3].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Li X, Xu D, Zhou H, and Li L, “Tucker tensor regression and neuroimaging analysis,” Statistics in Biosciences, vol. 10, no. 3, pp. 520–545, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Sun WW and Li L, “Store: sparse tensor response regression and neuroimaging analysis,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 4908–4944, 2017. [Google Scholar]
[6].Caiafa CF and Cichocki A, “Multidimensional compressed sensing and their applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3, no. 6, pp. 355–380, 2013. [Google Scholar]
[7].Friedland S, Li Q, and Schonfeld D, “Compressive sensing of sparse tensors,” IEEE Transactions on Image Processing, vol. 23, no. 10, pp. 4438–4447, 2014. [DOI] [PubMed] [Google Scholar]
[8].Liu J, Musialski P, Wonka P, and Ye J, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 208–220, 2013. [DOI] [PubMed] [Google Scholar]
[9].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
[10].Yuan M and Zhang C-H, “Incoherent tensor norms and their applications in higher order tensor completion,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6753–6766, 2017. [Google Scholar]
[11].Zhang A, “Cross: Efficient low-rank tensor completion,” The Annals of Statistics, vol. 47, no. 2, pp. 936–964, 2019. [Google Scholar]
[12].Montanari A and Sun N, “Spectral algorithms for tensor completion,” Communications on Pure and Applied Mathematics, vol. 71, no. 11, pp. 2381–2425, 2018. [Google Scholar]
[13].Ghadermarzy N, Plan Y, and Yilmaz Ö, “Near-optimal sample complexity for convex tensor completion,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 577–619, 2018. [Google Scholar]
[14].Zhang Z and Aeron S, “Exact tensor completion using t-svd,” IEEE Transactions on Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2016. [Google Scholar]
[15].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]
[16].Raskutti G, Yuan M, Chen H, et al. , “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statistics, vol. 47, no. 3, pp. 1554–1584, 2019. [Google Scholar]
[17].Chen H, Raskutti G, and Yuan M, “Non-convex projected gradient descent for generalized low-rank tensor regression,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 172–208, 2019. [Google Scholar]
[18].Li L and Zhang X, “Parsimonious tensor response regression,” Journal of the American Statistical Association, pp. 1–16, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Zhang A, Luo Y, Raskutti G, and Yuan M, “Islet: Fast and optimal low-rank tensor regression via importance sketching,” arXiv preprint arXiv:1911.03804, 2019. [Google Scholar]
[20].Romera-Paredes B, Aung MH, Bianchi-Berthouze N, and Pontil M, “Multilinear multitask learning,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp. III–1444–III–1452, JMLR.org, 2013. [Google Scholar]
[21].Bien J, Taylor J, Tibshirani R, et al. , “A lasso for hierarchical interactions,” The Annals of Statistics, vol. 41, no. 3, pp. 1111–1141, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Hao N and Zhang HH, “Interaction screening for ultrahigh-dimensional data,” Journal of the American Statistical Association, vol. 109, no. 507, pp. 1285–1301, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Fan Y, Kong Y, Li D, and Lv J, “Interaction pursuit with feature screening and selection,” arXiv preprint arXiv:1605.08933, 2016. [Google Scholar]
[24].Basu S, Kumbier K, Brown JB, and Yu B, “Iterative random forests to discover predictive and stable high-order interactions,” Proceedings of the National Academy of Sciences, p. 201711236, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing, pp. 517–520, IEEE, 2010. [Google Scholar]
[26].Vasilescu MAO and Terzopoulos D, “Multilinear subspace analysis of image ensembles,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, pp. II–93, IEEE, 2003. [Google Scholar]
[27].Sun WW, Lu J, Liu H, and Cheng G, “Provable sparse tensor decomposition,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 79, no. 3, pp. 899–916, 2017. [Google Scholar]
[28].Rauhut H, Schneider R, and Stojanac Ž, “Low rankˇ tensor recovery via iterative hard thresholding,” Linear Algebra and its Applications, vol. 523, pp. 220–262, 2017. [Google Scholar]
[29].Li X, Haupt J, and Woodruff D, “Near optimal sketching of low-rank tensor regression,” in Advances in Neural Information Processing Systems, pp. 3466–3476, 2017. [Google Scholar]
[30].Wang Z, Liu H, and Zhang T, “Optimal computational and statistical rates of convergence for sparse nonconvex learning problems,” The Annals of statistics, vol. 42, no. 6, p. 2164, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Loh P-L and Wainwright MJ, “Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” Journal of Machine Learning Research, vol. 16, pp. 559–616, 2015. [Google Scholar]
[32].Cai TT and Zhang A, “Rop: Matrix recovery via rank-one projections,” The Annals of Statistics, vol. 43, no. 1, pp. 102–138, 2015. [Google Scholar]
[33].Chen Y, Chi Y, and Goldsmith AJ, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015. [Google Scholar]
[34].Adamczak R, Litvak AE, Pajor A, and Tomczak-Jaegermann N, “Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling,” Constructive Approximation, vol. 34, no. 1, pp. 61–88, 2011. [Google Scholar]
[35].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
[36].Keshavan RH, Montanari A, and Oh S, “Matrix completion from a few entries,” IEEE Transactions on Information Theory, vol. 56, no. 6, pp. 2980–2998, 2010. [Google Scholar]
[37].Koltchinskii V, Lounici K, and Tsybakov AB, “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion,” The Annals of Statistics, pp. 2302–2329, 2011. [Google Scholar]
[38].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems 27 (Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, eds.), pp. 2897–2905, Curran Associates, Inc., 2014. [Google Scholar]
[39].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, pp. 7311–7338, November 2018. [Google Scholar]
[40].Mu C, Huang B, Wright J, and Goldfarb D, “Square deal: Lower bounds and improved relaxations for tensor recovery,” in Proceedings of the 31st International Conference on Machine Learning (Xing EP and Jebara T, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China: ), pp. 73–81, PMLR, 22–24 June 2014. [Google Scholar]
[41].Friedland S and Lim L-H, “Nuclear norm of higher-order tensors,” Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281, 2018. [Google Scholar]
[42].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013. [Google Scholar]
[43].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]
[44].Donoho DL, “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006. [Google Scholar]
[45].Chandrasekaran V, Sanghavi S, Parrilo PA, and Willsky AS, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011. [Google Scholar]
[46].Arora S, Ge R, and Moitra A, “New algorithms for learning incoherent and overcomplete dictionaries,” in Proceedings of The 27th Conference on Learning Theory (Balcan MF, Feldman V, and Szepesvri C, eds.), vol. 35 of Proceedings of Machine Learning Research, (Barcelona, Spain: ), pp. 779–806, PMLR, 13–15 June 2014. [Google Scholar]
[47].Zhang Y, Chen X, Zhou D, and Jordan MI, “Spectral methods meet em: A provably optimal algorithm for crowd-sourcing,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3537–3580, 2016. [Google Scholar]
[48].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016. [Google Scholar]
[49].Janzamin M, Sedghi H, and Anandkumar A, “Score function features for discriminative learning: matrix and tensor framework,” arXiv preprint arXiv:1412.2863, 2014. [Google Scholar]
[50].Anandkumar A, Ge R, and Janzamin M, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv preprint arXiv:1402.5180, 2014. [Google Scholar]
[51].Cai C, Li G, Poor HV, and Chen Y, “Nonconvex low-rank symmetric tensor completion from noisy data,” arXiv preprint arXiv:1911.04436, 2019. [Google Scholar]
[52].Candès EJ, Li X, and Soltanolkotabi M, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61, pp. 1985–2007, April 2015. [Google Scholar]
[53].Hung H, Lin Y-T, Chen P, Wang C-C, Huang S-Y, and Tzeng J-Y, “Detection of gene–gene interactions using multistage sparse and low-rank regression,” Biometrics, vol. 72, no. 1, pp. 85–94, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Sidiropoulos ND and Kyrillidis A, “Multi-way compressed sensing for sparse low-rank tensors,” IEEE Signal Processing Letters, vol. 19, no. 11, pp. 757–760, 2012. [Google Scholar]
[55].Bubeck S, Foundations and Trends in Machine Learning, ch. Convex Optimization: Algorithms and Complexity, pp. 231–357. 2015. [Google Scholar]
[56].Stein C, Diaconis P, Holmes S, Reinert G, et al. , “Use of exchangeable pairs in the analysis of simulations,” in Stein’s Method, pp. 1–25, Institute of Mathematical Statistics, 2004. [Google Scholar]
[57].Hitczenko P, Montgomery-Smith S, and Oleszkiewicz K, “Moment inequalities for sums of certain independent symmetric random variables,” Studia Math, vol. 123, no. 1, pp. 15–42, 1997. [Google Scholar]
[58].Talagrand M, “The supremum of some canonical processes,” American Journal of Mathematics, vol. 116, no. 2, pp. 283–325, 1994. [Google Scholar]
[59].Vershynin R, Compressed sensing, ch. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge Univ. Press, 2012. [Google Scholar]
[60].Yu B, “Assouad, fano, and le cam,” Festschrift for Lucien Le Cam, vol. 423, p. 435, 1997. [Google Scholar]
[61].Ledoux M and Talagrand M, Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. [Google Scholar]
[62].Tu S, Boczar R, Simchowitz M, Soltanolkotabi M, and Recht B, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning (Balcan MF and Weinberger KQ, eds.), vol. 48 of Proceedings of Machine Learning Research, (New York, New York, USA: ), pp. 964–973, PMLR, 20–22 June 2016. [Google Scholar]
[63].Horn RA and Johnson CR, Matrix Analysis. New York: Cambridge Univ. Press, 1988. [Google Scholar]
[64].Ledoux M, The concentration of measure phenomenon. No. 89, American Mathematical Soc., 2005. [Google Scholar]
[65].Nguyen NH, Drineas P, and Tran TD, “Tensor sparsification via a bound on the spectral norm of random tensors,” Information and Inference: A Journal of the IMA, vol. 4, no. 3, pp. 195–229, 2015. [Google Scholar]
[66].Bogucki R, “Suprema of canonical weibull processes,” Statistics & Probability Letters, vol. 107, pp. 253–263, 2015. [Google Scholar]
[67].De la Pena V and Gine E,´ Decoupling: from dependence to independence. Springer Science & Business Media, 2012. [Google Scholar]

[R1] [1].Kroonenberg PM, Applied Multiway Data Analysis. Wiley Series in Probability and Statistics, 2008. [Google Scholar]

[R2] [2].Kolda T and Bader B, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009. [Google Scholar]

[R3] [3].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Li X, Xu D, Zhou H, and Li L, “Tucker tensor regression and neuroimaging analysis,” Statistics in Biosciences, vol. 10, no. 3, pp. 520–545, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Sun WW and Li L, “Store: sparse tensor response regression and neuroimaging analysis,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 4908–4944, 2017. [Google Scholar]

[R6] [6].Caiafa CF and Cichocki A, “Multidimensional compressed sensing and their applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3, no. 6, pp. 355–380, 2013. [Google Scholar]

[R7] [7].Friedland S, Li Q, and Schonfeld D, “Compressive sensing of sparse tensors,” IEEE Transactions on Image Processing, vol. 23, no. 10, pp. 4438–4447, 2014. [DOI] [PubMed] [Google Scholar]

[R8] [8].Liu J, Musialski P, Wonka P, and Ye J, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 208–220, 2013. [DOI] [PubMed] [Google Scholar]

[R9] [9].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]

[R10] [10].Yuan M and Zhang C-H, “Incoherent tensor norms and their applications in higher order tensor completion,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6753–6766, 2017. [Google Scholar]

[R11] [11].Zhang A, “Cross: Efficient low-rank tensor completion,” The Annals of Statistics, vol. 47, no. 2, pp. 936–964, 2019. [Google Scholar]

[R12] [12].Montanari A and Sun N, “Spectral algorithms for tensor completion,” Communications on Pure and Applied Mathematics, vol. 71, no. 11, pp. 2381–2425, 2018. [Google Scholar]

[R13] [13].Ghadermarzy N, Plan Y, and Yilmaz Ö, “Near-optimal sample complexity for convex tensor completion,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 577–619, 2018. [Google Scholar]

[R14] [14].Zhang Z and Aeron S, “Exact tensor completion using t-svd,” IEEE Transactions on Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2016. [Google Scholar]

[R15] [15].Bengua JA, Phien HN, Tuan HD, and Do MN, “Efficient tensor completion for color image and video recovery: Low-rank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017. [DOI] [PubMed] [Google Scholar]

[R16] [16].Raskutti G, Yuan M, Chen H, et al. , “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statistics, vol. 47, no. 3, pp. 1554–1584, 2019. [Google Scholar]

[R17] [17].Chen H, Raskutti G, and Yuan M, “Non-convex projected gradient descent for generalized low-rank tensor regression,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 172–208, 2019. [Google Scholar]

[R18] [18].Li L and Zhang X, “Parsimonious tensor response regression,” Journal of the American Statistical Association, pp. 1–16, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Zhang A, Luo Y, Raskutti G, and Yuan M, “Islet: Fast and optimal low-rank tensor regression via importance sketching,” arXiv preprint arXiv:1911.03804, 2019. [Google Scholar]

[R20] [20].Romera-Paredes B, Aung MH, Bianchi-Berthouze N, and Pontil M, “Multilinear multitask learning,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp. III–1444–III–1452, JMLR.org, 2013. [Google Scholar]

[R21] [21].Bien J, Taylor J, Tibshirani R, et al. , “A lasso for hierarchical interactions,” The Annals of Statistics, vol. 41, no. 3, pp. 1111–1141, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Hao N and Zhang HH, “Interaction screening for ultrahigh-dimensional data,” Journal of the American Statistical Association, vol. 109, no. 507, pp. 1285–1301, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Fan Y, Kong Y, Li D, and Lv J, “Interaction pursuit with feature screening and selection,” arXiv preprint arXiv:1605.08933, 2016. [Google Scholar]

[R24] [24].Basu S, Kumbier K, Brown JB, and Yu B, “Iterative random forests to discover predictive and stable high-order interactions,” Proceedings of the National Academy of Sciences, p. 201711236, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Li N and Li B, “Tensor completion for on-board compression of hyperspectral images,” in 2010 IEEE International Conference on Image Processing, pp. 517–520, IEEE, 2010. [Google Scholar]

[R26] [26].Vasilescu MAO and Terzopoulos D, “Multilinear subspace analysis of image ensembles,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, pp. II–93, IEEE, 2003. [Google Scholar]

[R27] [27].Sun WW, Lu J, Liu H, and Cheng G, “Provable sparse tensor decomposition,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 79, no. 3, pp. 899–916, 2017. [Google Scholar]

[R28] [28].Rauhut H, Schneider R, and Stojanac Ž, “Low rankˇ tensor recovery via iterative hard thresholding,” Linear Algebra and its Applications, vol. 523, pp. 220–262, 2017. [Google Scholar]

[R29] [29].Li X, Haupt J, and Woodruff D, “Near optimal sketching of low-rank tensor regression,” in Advances in Neural Information Processing Systems, pp. 3466–3476, 2017. [Google Scholar]

[R30] [30].Wang Z, Liu H, and Zhang T, “Optimal computational and statistical rates of convergence for sparse nonconvex learning problems,” The Annals of statistics, vol. 42, no. 6, p. 2164, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Loh P-L and Wainwright MJ, “Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” Journal of Machine Learning Research, vol. 16, pp. 559–616, 2015. [Google Scholar]

[R32] [32].Cai TT and Zhang A, “Rop: Matrix recovery via rank-one projections,” The Annals of Statistics, vol. 43, no. 1, pp. 102–138, 2015. [Google Scholar]

[R33] [33].Chen Y, Chi Y, and Goldsmith AJ, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Transactions on Information Theory, vol. 61, no. 7, pp. 4034–4059, 2015. [Google Scholar]

[R34] [34].Adamczak R, Litvak AE, Pajor A, and Tomczak-Jaegermann N, “Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling,” Constructive Approximation, vol. 34, no. 1, pp. 61–88, 2011. [Google Scholar]

[R35] [35].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]

[R36] [36].Keshavan RH, Montanari A, and Oh S, “Matrix completion from a few entries,” IEEE Transactions on Information Theory, vol. 56, no. 6, pp. 2980–2998, 2010. [Google Scholar]

[R37] [37].Koltchinskii V, Lounici K, and Tsybakov AB, “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion,” The Annals of Statistics, pp. 2302–2329, 2011. [Google Scholar]

[R38] [38].Richard E and Montanari A, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems 27 (Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, eds.), pp. 2897–2905, Curran Associates, Inc., 2014. [Google Scholar]

[R39] [39].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, pp. 7311–7338, November 2018. [Google Scholar]

[R40] [40].Mu C, Huang B, Wright J, and Goldfarb D, “Square deal: Lower bounds and improved relaxations for tensor recovery,” in Proceedings of the 31st International Conference on Machine Learning (Xing EP and Jebara T, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China: ), pp. 73–81, PMLR, 22–24 June 2014. [Google Scholar]

[R41] [41].Friedland S and Lim L-H, “Nuclear norm of higher-order tensors,” Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281, 2018. [Google Scholar]

[R42] [42].Hillar CJ and Lim L-H, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013. [Google Scholar]

[R43] [43].Anandkumar A, Ge R, Hsu D, Kakade SM, and Telgarsky M, “Tensor decompositions for learning latent variable models,” Journal of Machine Learning Research, vol. 15, pp. 2773–2832, 2014. [Google Scholar]

[R44] [44].Donoho DL, “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006. [Google Scholar]

[R45] [45].Chandrasekaran V, Sanghavi S, Parrilo PA, and Willsky AS, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011. [Google Scholar]

[R46] [46].Arora S, Ge R, and Moitra A, “New algorithms for learning incoherent and overcomplete dictionaries,” in Proceedings of The 27th Conference on Learning Theory (Balcan MF, Feldman V, and Szepesvri C, eds.), vol. 35 of Proceedings of Machine Learning Research, (Barcelona, Spain: ), pp. 779–806, PMLR, 13–15 June 2014. [Google Scholar]

[R47] [47].Zhang Y, Chen X, Zhou D, and Jordan MI, “Spectral methods meet em: A provably optimal algorithm for crowd-sourcing,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3537–3580, 2016. [Google Scholar]

[R48] [48].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow,” The Annals of Statistics, vol. 44, no. 5, pp. 2221–2251, 2016. [Google Scholar]

[R49] [49].Janzamin M, Sedghi H, and Anandkumar A, “Score function features for discriminative learning: matrix and tensor framework,” arXiv preprint arXiv:1412.2863, 2014. [Google Scholar]

[R50] [50].Anandkumar A, Ge R, and Janzamin M, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv preprint arXiv:1402.5180, 2014. [Google Scholar]

[R51] [51].Cai C, Li G, Poor HV, and Chen Y, “Nonconvex low-rank symmetric tensor completion from noisy data,” arXiv preprint arXiv:1911.04436, 2019. [Google Scholar]

[R52] [52].Candès EJ, Li X, and Soltanolkotabi M, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61, pp. 1985–2007, April 2015. [Google Scholar]

[R53] [53].Hung H, Lin Y-T, Chen P, Wang C-C, Huang S-Y, and Tzeng J-Y, “Detection of gene–gene interactions using multistage sparse and low-rank regression,” Biometrics, vol. 72, no. 1, pp. 85–94, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Sidiropoulos ND and Kyrillidis A, “Multi-way compressed sensing for sparse low-rank tensors,” IEEE Signal Processing Letters, vol. 19, no. 11, pp. 757–760, 2012. [Google Scholar]

[R55] [55].Bubeck S, Foundations and Trends in Machine Learning, ch. Convex Optimization: Algorithms and Complexity, pp. 231–357. 2015. [Google Scholar]

[R56] [56].Stein C, Diaconis P, Holmes S, Reinert G, et al. , “Use of exchangeable pairs in the analysis of simulations,” in Stein’s Method, pp. 1–25, Institute of Mathematical Statistics, 2004. [Google Scholar]

[R57] [57].Hitczenko P, Montgomery-Smith S, and Oleszkiewicz K, “Moment inequalities for sums of certain independent symmetric random variables,” Studia Math, vol. 123, no. 1, pp. 15–42, 1997. [Google Scholar]

[R58] [58].Talagrand M, “The supremum of some canonical processes,” American Journal of Mathematics, vol. 116, no. 2, pp. 283–325, 1994. [Google Scholar]

[R59] [59].Vershynin R, Compressed sensing, ch. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge Univ. Press, 2012. [Google Scholar]

[R60] [60].Yu B, “Assouad, fano, and le cam,” Festschrift for Lucien Le Cam, vol. 423, p. 435, 1997. [Google Scholar]

[R61] [61].Ledoux M and Talagrand M, Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. [Google Scholar]

[R62] [62].Tu S, Boczar R, Simchowitz M, Soltanolkotabi M, and Recht B, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning (Balcan MF and Weinberger KQ, eds.), vol. 48 of Proceedings of Machine Learning Research, (New York, New York, USA: ), pp. 964–973, PMLR, 20–22 June 2016. [Google Scholar]

[R63] [63].Horn RA and Johnson CR, Matrix Analysis. New York: Cambridge Univ. Press, 1988. [Google Scholar]

[R64] [64].Ledoux M, The concentration of measure phenomenon. No. 89, American Mathematical Soc., 2005. [Google Scholar]

[R65] [65].Nguyen NH, Drineas P, and Tran TD, “Tensor sparsification via a bound on the spectral norm of random tensors,” Information and Inference: A Journal of the IMA, vol. 4, no. 3, pp. 195–229, 2015. [Google Scholar]

[R66] [66].Bogucki R, “Suprema of canonical weibull processes,” Statistics & Probability Letters, vol. 107, pp. 253–263, 2015. [Google Scholar]

[R67] [67].De la Pena V and Gine E,´ Decoupling: from dependence to independence. Springer Science & Business Media, 2012. [Google Scholar]

PERMALINK

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Botao Hao

Anru Zhang

Guang Cheng

Roles

Abstract

I. Introduction

Fig. 1.

II. Preliminary

III. Symmetric Tensor Estimation Via Cubic Sketchings

A. Initialization

Step 1: Unbiased Empirical Moment Estimator.

Step 2: Sparse Tensor Decomposition.

B. Thresholded Gradient Descent

Step 3: Updating B via Thresholded Gradient descent.

Step 4: Updating η via Normalization.

IV. Theoretical Analysis

A. Assumptions

B. Main Theoretical Results

C. Key Lemmas: High-order Concentration Inequalities

V. Application To High-Order Interaction Effect Models

VI. Non-Symmetric Tensor Estimation Model

Step 1: (Method of Tensor Moments)

Step 2: (Block-wise Gradient Descent)

VII. Numerical Results

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

VIII. Discussions

Fig. 2.

TABLE I.

Acknowledgment

Biography

Appendix

A. Moment Calculation

B. Proofs of Lemmas 1 and 2: Concentration Inequalities

C. Proof of Theorem 2: Initialization Effect

D. Proof of Theorem 1: Gradient Update

E. Proofs of Theorems 4 and 6: Minimax Lower Bounds

F. Proof of Theorem 7: High-order Stein’s Lemma

G. Proofs of Lemmas 3, 4, and 5: Moment Calculation

1). Proof of Lemma 3:

2). Proof of Lemma 4:

3). Proof of Lemma 5:

H. Proof of Lemma 6

I. Proof of Lemma 9

J. Proof of Lemma 11

K. Proof of Lemma 12

L. Proof of Lemma 13

1). Bounding gradient update effect:

Step One: Lower bound for A1.

Step Two: Upper bound for A2.

Step Three: Upper bound for A3.

Step Four: Upper bound for A4.

2). Bounding thresholding effect:

3). Ensemble:

M. Proof of Lemma 14

N. Proof of Lemma 15

O. Non-symmetric Tensor Estimation

1). Conditions and Algorithm:

2). Proof of Theorem 5:

3). Bounding I1:

Step One:

Step Two:

Step Three:

4). Bounding I2:

5). Bounding I3:

6). Bounding I4:

7). Summary:

P. Matrix Form Gradient and Stochastic Gradient descent

1). Matrix Formulation of Gradient:

2). Stochastic Gradient descent:

Q. Technical Lemmas

Contributor Information

References

ACTIONS

PERMALINK

Step One: Lower bound for A₁.

Step Two: Upper bound for A₂.

Step Three: Upper bound for A₃.

Step Four: Upper bound for A₄.

3). Bounding I₁:

4). Bounding I₂:

5). Bounding I₃:

6). Bounding I₄: