Determining the Number of Latent Factors in Statistical Multi-Relational Learning

Chengchun Shi; Wenbin Lu; Rui Song

. Author manuscript; available in PMC: 2020 Jan 24.

Published in final edited form as: J Mach Learn Res. 2019;20:23.

Determining the Number of Latent Factors in Statistical Multi-Relational Learning

Chengchun Shi ¹, Wenbin Lu ¹, Rui Song ¹

PMCID: PMC6980192 NIHMSID: NIHMS1051149 PMID: 31983896

Abstract

Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer s, RESCAL computes an s-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering.

The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.

Keywords: Information criteria, Knowledge graph, Model selection consistency, RESCAL model, Statistical relational learning, Tensor factorization

1. Introduction

Relational data is becoming ubiquitous in artificial intelligence and social network analysis. These data sets are in the form of graphs, with nodes and edges representing entities and relationships, respectively. Recently, a number of companies have developed and released their knowledge graphs, including the Google Knowledge Graph, Microsoft Bing’s Satori Knowledge Base, Yandex’s Object Answer, the Linkedln Knowledge Graph, etc. These knowledge graphs are graph-structured knowledge bases that store factual information as relationships between entities. They are created via the automatic extraction of semantic relationships from semi-structured or unstructured text (see Section II.C in Nickel et al., 2016). The data may be incomplete, noisy and contain false information. It is therefore of great importance to infer the existence of a particular relationship to improve the quality of these extracted information.

Statistical relational learning is primarily concerned with learning from relational data sets, and solving tasks such as predicting whether two entities are related (link prediction), identifying equivalent entities (entity resolution), and grouping similar entities based on their relationships (link-based clustering). Statistical relational models can be roughly divided into three categories: the relational graphical models, the latent class models and the tensor factorization models. Relational graphical models include probabilistic relational models (Getoor and Mihalkova, 2011) and Markov logic networks (MLN, Richardson and Domingos, 2006). These models are constructed via Bayesian or Markov networks. In latent class models, each entity is assigned to one of the latent classes and the probability of a relationship between entities depends on their corresponding classes. Two important examples include the stochastic block model (SBM, Nowicki and Snijders, 2001) and the infinite relational model (IRM, Kemp et al., 2006). IRM can be viewed as a nonparametric extension of SBM where the total number of clusters is not prespecified. Both models have received considerable attentions in the statistics and machine learning literature for community detection in networks.

Tensors are multidimensional arrays. Tensor factorization methods such as CANDE-COMP/PARAFAC (CP, Harshman and Lundy, 1994), Tucker (Tucker, 1966) and their extensions have found applications in a variety of fields. Kolda and Bader (2009) presented a thorough overview of tensor decompositions and their applications. Recently, tensor factorizations are being actively studied in the statistics literature and have becoming an emerging field of statistics. To name a few, Chi and Kolda (2012) developed a Poisson tensor factorization model for sparse count data. Yang and Dunson (2016) proposed a conditional tensor factorization model for high-dimensional classification with categorical predictors. Sun et al. (2017) proposed a sparse tensor decomposition method by incorporating a truncation step into the tensor power iteration step.

Relational data sets are typically expressed as (subject, predicate, object) triples and can be grouped as a third-order tensor. As a result, tensor factorization methods can be naturally applied to these data sets. Nickel (2013) proposed a RESCAL factorization model for statistical relational learning. Compared to other tensor factorization approaches such as CP and Tucker methods, RESCAL is more capable of detecting the correlations produced between multiple interconnected nodes. For relational data consisting of n entities, K types of relations, and a positive integer s, RESCAL computes an n × s factor matrix and an s × s × K core tensor. The factor matrix and the core tensor can be further used for link prediction, entity resolution and link-based clustering. Nickel et al. (2011) showed that a linear RESCAL model achieved better or comparable results on common benchmark data sets when compared to other existing methods such as MLN, DEDICOM (Harshman, 1978), IRM, CP, MRC (Kok and Domingos, 2007), etc. It was shown in Nickel and Tresp (2013) that a logistic RESCAL model could further improve the link prediction results.

Central to the empirical validity of RESCAL is the correct specification of the number of latent factors. Nickel et al. (2011) proposed to select this parameter via cross-validation. As commonly known for cross-validation methods, there’s no theoretical guarantee against overestimation. Besides, cross-validation can be computationally expensive, especially for large n and K. In the literature, model selection is less studied for tensor factorization methods. Allen (2012) and Sun et al. (2017) proposed to use Bayesian information criteria (BIC, Schwarz, 1978) for sparse CP decomposition. However, no theoretical results were provided for BIC. Indeed, we show in this paper that a BIC-type criterion may fail for the RESCAL model.

The contribution of this paper is twofold. First, we propose a general class of information criteria for the RESCAL model and prove their model selection consistency. Although we focus on the RESCAL model, our information criteria can be extended to select models for general tensor factorization methods with slight modification. The problem is nonstandard and challenging since both the factor matrix and the core tensor are not observed and need to be estimated. Besides, the model parameters are non-identifiable. Moreover, the derivation of model/tuning parameter selection consistency of information criteria usually relies on the (uniform) consistency of estimated parameters. For example, Fan and Tang (2013) derived the uniform consistency of the maximum likelihood estimators (MLEs) to prove the consistency of GIC (see Proposition 2 in that paper). Zhang et al. (2016) established the uniform consistency of the support vector machine solutions to prove the consistency of SVMIC_h (see Lemma 2 in that paper). The consistency of these estimators are due to the concavity (convexity) of the likelihood (or the empirical loss) functions. In contrast, for most tensor decomposition models including RESCAL, the likelihood (or the empirical loss) function is usually non-concave (non-convex) and may have multiple local solutions. As a result, the corresponding global maximizer (minimizer) may not be consistent even with the identifiability constraints. It remains unknown how to establish the consistency of the information criterion without consistency of the estimator. A key innovation in our analysis is to design a “proper” pseudometric and show that the global optimum is consistent under this specific pseudometric. We further establish the rate of convergence of the global optimum under this pseudometric as a function of n and K. Based on these results, we establish the consistency of our information criteria when K is either bounded or diverges at a proper rate of n. No parametric assumptions are imposed on the latent factors. Second, we introduce a scalable algorithm for estimating the parameters in the logistic RESCAL model. Despite the fact that a linear RESCAL model can be conveniently solved by an alternating least square algorithm (Nickel et al., 2011), there are lack of optimization algorithms for solving general RESCAL models. The proposed algorithm is based on the alternating direction method of multipliers (ADMM, Boyd et al., 2011) and can be implemented in a parallelized fashion.

The rest of the paper is organized as follows. We formally introduce the RESCAL model and study the parameter identifiability in Section 2. Our information criteria are presented in Section 3 and their model selection properties are investigated. Numerical examples are presented in Section 4 to examine the finite sample performance of the proposed information criteria. Section 5 concludes with a summary and discussion of future extensions. All the proofs are given in the Appendix.

2. The RESCAL Model

This section is structured as follows. We introduce the RESCAL model in Section 2.1. In Section 2.2, we study the identifiability of parameters in the model.

2.1. Model Setup

In knowledge graphs, facts can be expressed in the form of (subject, predicate, object) triples, where subject and object are entities and predicate is the relation between entities. For example, consider the following sentence from Wikipedia:

Jon Snow is a fictional character in the A Song of Ice and Fire series of fantasy novels by American author George R. R. Martin, and its television adaptation Game of Thrones.

The information contained in this message can be summarized into the following set of (subject, predicate, object) triples:

Subject	Predicate	Object
Jon Snow	character in	A Song of Ice and Fire
Jon Snow	character in	Game of Thrones
A Song of Ice and Fire	genre	novel
Game of Thrones	genre	television series
George R.R. Martin	author of	A Song of Ice and Fire
George R.R. Martin	profession	novelist

Open in a new tab

In this example, we have a total of 7 entities, 4 types of relations and 6 triples. More generally, let $E = {e_{1}, \dots, e_{n}}$ denote the set of all entities and $R = {r_{1}, \dots, r_{K}}$ denote the set of all relation types. The number of relations K is either bounded or diverges with n. Assuming non-existing triples indicate false relationships, we can construct a third-order binary tensor

Y = {Y_{i j k}}_{i, j \in {1, \dots, n}, k \in {1, \dots, K}},

such that

Y_{i j k} = {\begin{array}{l} 1, & if a triple (e_{i}, r_{k}, e_{j}) exists, \\ 0, & otherwise . \end{array}

The RESCAL model is defined as follows. For each entity e_i, a latent vector $a_{i, 0} \in ℝ^{s_{0}}$ is generated. The Y_ijk’s are assumed to be conditionally independent given all latent factors ${a_{i, 0}}_{i = 1}^{n}$ . Besides, it is assumed that

\Pr (Y_{i j k} = 1 | {a_{i, 0}}_{i = 1}^{n}) = g (a_{i, 0}^{T} R_{k, 0} a_{j, 0}),

(1)

for some strictly monotone link function g and s₀ × s₀ matrices $R_{1, 0}, \dots, R_{K, 0}$ . In the above model, a_i,0 corresponds to the latent representation of the ith entity and R_k,0 specifies how these a_i,0’s interact for the k-th relation. To account for asymmetric relations, we do not restrict R_k,0’s to symmetric matrices. When the relations are symmetric, i.e.,

\Pr (Y_{i j k} = 1 | {a_{i, 0}}_{i = 1}^{n}) = \Pr (Y_{j i k} = 1 | {a_{i, 0}}_{i = 1}^{n}), \forall i, j, k,

one can impose the symmetry constraints and obtain a similar derivation.

For continuous Y_ijk, a related tensor factorization model is the TUCKER-2 decomposition, which decomposes the tensor into

Y_{i j k} = a_{i, 0}^{T} R_{k, 0} b_{j, 0} + e_{i j k}, \forall i, j, k,

(2)

for some $a_{1, 0}, \dots, a_{n, 0} \in ℝ^{s_{1}}, b_{1, 0}, \dots, b_{n, 0} \in ℝ^{s_{2}}, R_{1, 0}, \dots, R_{K, 0} \in ℝ^{s_{1} \times s_{2}}$ and some (random) errors ${e_{i j k}}_{i j k}$ . By Equation 1, RESCAL can be interpreted as a “nonlinear” TUCKER-2 model with the additional constraints that s₁ = s₂ = s₀ and $a_{i, 0} = b_{i, 0}, \forall i$ .

CP decomposition is another important tensor factorization method that decomposes a tensor into a sum of rank-1 tensors. It assumes that

Y_{i j k} = \sum_{s = 1}^{s_{0}} a_{i, s} b_{j, s} r_{k, s} + e_{i j k},

for some ${a_{i, s}}_{i, s}, {b_{j, s}}_{j, s}, {r_{k, s}}_{k, s}$ and ${e_{i j k}}_{i j k}$ . Define $a_{i, 0} = {(a_{i, 1}, \dots, a_{i, s_{0}})}^{T}$ and $b_{i, 0} = {(b_{i, 1}, \dots, b_{i, s_{0}})}^{T}$ . In view of Equation 2, CP is a special TUCKER-2 model with the constraints that s₁ = s₂ = s₀ and $R_{k, 0} = diag (r_{k, 1}, \dots, r_{k, s_{0}})$ where $diag (r_{k, 1}, \dots, r_{k, s_{0}})$ is a diagonal matrix with the sth diagonal elements being r_k,s.

In this paper, the proposed information criteria are designed in particular for the RESCAL model. However, they can be extended to estimate s₀ in a more general tensor factorization framework including CP and TUCKER-2 models. We discuss this further in Section 5.

2.2. Identifiability

The parameterization in Equation 1 is not identifiable. To see this, for any nonsingular matrix $C \in ℝ^{s_{0} \times s_{0}}$ , we define $a_{i} = C^{- 1} a_{i, 0}, R_{k} = C^{T} R_{k, 0} C, \forall i, k$ . Observe that

a_{i, 0}^{T} R_{k, 0} a_{j, 0} = a_{i}^{T} R_{k} a_{j}, \forall i, j, k

and hence we have

\Pr (Y_{i j k} = 1) = g (a_{i}^{T} R_{k} a_{j}) .

Let $A_{0} = {[a_{1, 0}, \dots, a_{n, 0}]}^{T}$ . We impose the following condition.

(A0) (i) Assume A₀ has full column rank. (ii) Assume the matrix $[R_{1, 0}^{T}, \dots, R_{K, 0}^{T}]$ has full row rank.

(A0)(i) requires the latent factors to be linearly independent. (A0)(ii) holds when at least one of the R_k,0’s has full rank. Under Condition (A0), the following lemma states that the RESCAL model is identifiable up to a nonsingular linear transformation. In Section B.1 of the Appendix, we show (A0) is also necessary to guarantee such identifiability when $R_{1, 0}, \dots, R_{K, 0}$ are symmetric.

Lemma 1 (Identifiability).

Assume (A0) holds. Assume there exist some ${a_{i}}_{i}, {R_{k}}_{k}$ such that $a_{i} \in ℝ^{s_{0}}$ , $R_{k} \in ℝ^{s_{0} \times s_{0}}$ and

g (a_{i, 0}^{T} R_{k, 0} a_{j, 0}) = g (a_{i}^{T} R_{k} a_{j}), \forall i, j, k .

Then, there exists some invertible matrix $C \in ℝ^{s_{0} \times s_{0}}$ such that

a_{i} = C^{- 1} a_{i, 0} and R_{k} = C^{T} R_{k, 0} C .

To fix the nonsingular transformation indeterminacy, we adopt a specific constrained parameterization and focus on estimating ${a_{i}^{*}}_{i}$ and ${R_{k}^{*}}_{k}$ where

a_{i}^{*} = {(A_{s_{0}, 0}^{- 1})}^{T} a_{i, 0} and R_{k}^{*} = A_{s_{0}, 0} R_{k, 0} A_{s_{0}, 0}^{T},

where $A_{s_{0}, 0} = {[a_{1, 0}, \dots, a_{s_{0}, 0}]}^{T}$ . Observe that

[a_{1}^{*}, \dots, a_{s_{0}}^{*}] = {(A_{s_{0}, 0}^{- 1})}^{T} [a_{1, 0}, \dots, a_{s_{0}, 0}] = {(A_{s_{0}, 0}^{- 1})}^{T} A_{s_{0}, 0}^{T} = I_{s_{0}},

where I_s₀ stands for an s₀ × s₀ identity matrix. Therefore, the first $s_{0} a_{i}^{*}$ ’s are fixed as long as $A_{s_{0}, 0}$ is nonsingular. By Lemma 1, the parameters ${a_{i}^{*}}_{i}$ and ${R_{k}^{*}}_{k}$ are estimable.

From now on, we only consider the logistic link function for simplicity, i.e, g(x) = 1/{1 + exp(−x)}. Results for other link functions can be similarly discussed.

3. Model Selection

Parameters ${a_{i}^{*}}_{i = 1}^{n}$ and ${R_{k}^{*}}_{k = 1}^{K}$ can be estimated by maximizing the (conditional) log-likelihood function. Since we use the logistic link function, the log-likelihood is equal to

l_{n} (Y; {a_{i}}_{i}, {R_{k}}_{k}) = \log (\prod_{i j k} g {(a_{i}^{T} R_{k} a_{j})}^{Y_{i j k}} {1 - g (a_{i}^{T} R_{k} a_{j})}^{1 - Y_{i j k}}) = \sum_{i j k} (Y_{i j k} \log {g (a_{i}^{T} R_{k} a_{j})} + (1 - Y_{i j k}) \log {1 - g (a_{i}^{T} R_{k} a_{j})}) = \sum_{i j k} (Y_{i j k} a_{i}^{T} R_{k} a_{j} - \log {1 + \exp (a_{i}^{T} R_{k} a_{j})}),

where the first equality is due to the conditional independence assumption.

We assume the number of latent factors s₀ is fixed. For any $s \in {1, \dots, s_{\max}}$ here s_max is allowed to diverge with n and satisfies s_max ≥ s₀, we define the following constrained maximum likelihood estimator

({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) = \underset{\underset{vec (R_{1}^{(s)}), \dots, vec (R_{K}^{(s)}) \in Θ_{r}^{(s)}}{a_{1}^{(s)}, \dots, a_{n}^{(s)} \in Θ_{a}^{(s)}}}{arg max} l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}),

(3)

subject to [a_{1}^{(s)}, \dots, a_{s}^{(s)}] = I_{s},

(4)

for some $Θ_{a}^{(s)} \subseteq ℝ^{s}, Θ_{r}^{(s)} \subseteq ℝ^{s^{2}}$ , where the vec(·) operator stacks the entries of a matrix into a column vector. To estimate the number of latent factors, we define the following likelihood-based information criteria

IC (s) = 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - s κ (n, K),

for some penalty functions κ(·,·). The estimated number of latent factors is given by

\hat{s} = \underset{s \in {1, \dots, s_{\max}}}{arg max} IC (s) .

(5)

In addition to the constraint in Equation 4, there exist many other constraints that would make the estimators identifiable. The choice of the identifiability constraints might affect the value of IC. However, it wouldn’t affect the value of $\hat{s}$ . Detailed discussions can be found in Section A of the Appendix.

A major technical difficulty in establishing the consistency of IC is due to the nonconcavity of the objective function given in Equation 3. For any ${a_{j}}_{j \in {1, \dots, n}}, {R_{k}}_{k \in {1, \dots, K}}$ , let

β = {(a_{1}^{T}, \dots, a_{n}^{T}, vec {(R_{1})}^{T}, \dots, vec {(R_{K})}^{T})}^{T},

be the set of parameters.

For any $b_{1}, \dots, b_{n} \in ℝ^{s}, T_{1}, \dots, T_{K} \in ℝ^{s \times s}$ , we define

ζ = {(b_{1}^{T}, \dots, b_{n}^{T}, vec {(T_{1})}^{T}, \dots, vec {(T_{K})}^{T})}^{T} .

With some calculations, we can show that

- ζ^{T} \frac{\partial^{2} l_{n}}{\partial β \partial β^{T}} ζ = \underset{I_{1}}{\underset{︸}{\sum_{i j k} π_{i j k} (1 - π_{i j k}) {(b_{i}^{T} R_{k} a_{j} + a_{i}^{T} R_{k} b_{j} + a_{i}^{T} T_{k} a_{j})}^{2}}} + \underset{I_{2}}{\underset{︸}{\sum_{i j k} (π_{i j k} - Y_{i j k}) (2 b_{i}^{T} R_{k} b_{j} + b_{i}^{T} T_{k} a_{j} + a_{i}^{T} T_{k} b_{j})}},

where $π_{i j k} = \exp (a_{i}^{T} R_{k} a_{j}) / {1 + \exp (a_{i}^{T} R_{k} a_{j})}$ . Here, I₁ is nonnegative. However, I₂ can be negative for some β and ζ. Therefore, the negative Hessian matrix is not positive semidefinite and the likelihood function is not concave. As a result, ${\hat{a}}_{i}^{s_{0}}$ and ${\hat{R}}_{k}^{s_{0}}$ may not be consistent to ${\hat{a}}_{i}^{*}$ and ${\hat{R}}_{k}^{*}$ , even with the identifiability constraints in Equation 4. Here, the presence of I₂ is due to the bilinear formulation of the RESCAL model.

Let $θ_{i j k} = a_{i}^{T} R_{k} a_{j}$ . Notice that $l_{n}$ is concave in θ_ijk, ∀i, j, k. This motivates us to consider the following pseudometric:

d ({a_{i, 1}^{(s_{1})}}_{i}, {R_{k, 1}^{(s_{1})}}_{k_{1}}; {a_{i, 2}^{(s_{2})}}_{i}, {R_{k, 2}^{(s_{2})}}_{k_{2}}) = {\frac{1}{n^{2} K} \sum_{i j k} {({(a_{i, 1}^{(s_{1})})}^{T} {(R_{k, 1}^{(s_{1})})}^{T} a_{j, 1}^{(s_{1})} - {(a_{i, 2}^{(s_{2})})}^{T} {(R_{k, 2}^{(s_{2})})}^{T} a_{j, 2}^{(s_{2})})}^{2}}^{1 / 2},

for any integers s₁, s₂ > 0 and $a_{i, 1}^{(s_{1})} \in ℝ^{s_{1}}$ , $R_{k, 1}^{(s_{1})} \in ℝ^{s_{1} \times s_{1}}$ , $a_{i, 2}^{(s_{2})} \in ℝ^{s_{2}}$ , $R_{k, 2}^{(s_{2})} \in ℝ^{s_{2} \times s_{2}}$ . Apparently, d(·,·) is nonnegative, symmetric and satisfies the triangle inequality. Below, we establish the convergence rate of

d ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) .

We first introduce some notation. For any s > s₀, we define

a_{i, 0}^{(s)} = {\begin{array}{l} {({(a_{i, 0})}^{T}, 0_{s - s_{0}}^{T})}^{T}, & i \notin {s_{0} + 1, \dots, s}, \\ ({(a_{i, 0})}^{T}, \underset{i - s_{0} - 1}{\underset{︸}{0, \dots, 0,}} 1, \underset{s - i}{\underset{︸}{0, \dots, 0)^{T}}}, & i \in {s_{0} + 1, \dots, s}, \end{array}

and

R_{k, 0}^{(s)} = (\begin{matrix} R_{k, 0} & O_{r, s - r} \\ O_{s - r, r} & O_{s - r, s - r} \end{matrix}),

where 0_q denotes a q-dimensional zero vector and O_p,q is an p × q zero matrix. With a slight abuse of notation, we write $a_{i, 0}^{(s_{0})} = a_{i, 0}$ and $R_{k, 0}^{(s_{0})} = R_{k, 0}$ . Clearly, for any s ≥ s₀, we have

{(a_{i, 0}^{(s)})}^{T} R_{k, 0}^{(s)} a_{j, 0}^{(s)} = a_{i, 0}^{T} R_{k, 0} a_{j, 0}, \forall i, j, k,

and hence

({a_{i, 0}^{(s)}}_{i}, {R_{k, 0}^{(s)}}_{k}) = \underset{{a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}}{\arg \max} E l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}),

Let

a_{i}^{(s) *} = {(A_{s, 0}^{- 1})}^{T} a_{i, 0}^{(s)} and R_{k}^{(s) *} = A_{s, 0} R_{k, 0}^{(s)} A_{s, 0}^{T},

where $A_{s, 0} = {[a_{1, 0}^{(s)}, \dots, a_{s, 0}^{(s)}]}^{T}$ . When $A_{s_{0}, 0}$ is invertible, A_s,0’s are invertible for all s > s₀. The defined ${a_{i}^{(s) *}}^{'}$ ’s satisfy the identifiability constraints in Equation 4 for all s ≥ s₀. We make the following assumption.

(A1) Assume $a_{i}^{(s) *} \in Θ_{a}^{(s)}$ and $vec (R_{k}^{(s) *}) \in Θ_{r}^{(s)}$ , $\forall i = 1, \dots, n, k = 1, \dots, K$ and s₀ ≤ s ≤ s_max. In addition, assume $\sup_{x \in Θ_{a}^{(s)}} ‖ x ‖_{2} \leq ω_{a}$ , $\sup_{y \in Θ_{r}^{(s)}} ‖ y ‖_{2} \leq ω_{r}$ for some $ω_{a}, ω_{r} > 0$ .

Lemma 2.

Assume (A1) holds, $s_{\max} = o {\sqrt{n / \log (n K)}}$ . Then there exists some constant C₀ > 0 such that the following event occurs with probability tending to 1,

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} d^{2} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \leq \frac{\exp (2 ω_{a}^{2} ω_{r}) (n + K) (\log n + \log K)}{n^{2} K} .

Under the condition $s_{\max} = o {\sqrt{n / \log (n K)}}$ , we have that

\frac{s_{\max}^{2} (n + K) (\log n + \log K)}{n^{2} K} \leq \frac{s_{\max}^{2} (2 n \log n + 2 K \log K)}{n^{2} K} \leq \frac{2 s_{\max}^{2} \log n}{n} + \frac{2 s_{\max}^{2} \log K}{n^{2}} = o (1) .

When ω_a and ω_r are bounded, it follows that

\max_{s \in {s_{0}, \dots, s_{\max}}} d^{2} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \leq \frac{s_{\max}^{2} \exp (2 ω_{a}^{2} ω_{r}) (n + K) (\log n + \log K)}{n^{2} K} = o (1) .

Hence, ${{\hat{a}}_{i}^{(s)}}_{i}$ and ${{\hat{R}}_{k}^{(s)}}_{k}$ are consistent under the pseudometric d for all overfitted models. On the contrary, for underfitted models, we require the following conditions.

(A2) Assume there exists some constant $\bar{c} > 0$ such that $λ_{\min} (A_{0}^{T} A_{0}) \geq n \bar{c}$ .

(A3) Let $\bar{K} = λ_{\min} (\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0})$ . Assume ${\lim \inf}_{n} \bar{K} > 0$ .

Lemma 3.

Assume (A2) and (A3) hold. The for any $s \in {1, 2, \dots, s_{0} - 1}$ , we have

d^{2} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \geq \frac{{\bar{c}}^{2} \bar{K}}{K},

where $\bar{c}$ and $\bar{K}$ are defined in (A2) and (A3), respectively.

Assumption (A3) holds if there exists some $k_{0} \in {1, \dots, K}$ such that

{\lim \inf}_{n} λ_{\min} (R_{k_{0}, 0} R_{k_{0}, 0}^{T}) > 0.

When $\bar{K} \geq c^{'} K$ for some constant c^′ > 0, it follows from Lemma 3 that

\underset{n}{\lim \inf} d ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) > 0.

Based on these results, we establish the consistency of $\hat{s}$ defined in Equation 5 below. For any sequences {a_n} and {b_n}, we write a_n ~ b_n if there exist some universal constants c₁, c₂ > 0 such that c₁a_n ≤ b_n ≤ c₂a_n.

Theorem 1.

Assume (A1)-(A3) hold, $K ~ n^{l_{0}}$ or some $0 \leq l_{0} \leq 1, s_{\max} = o {\sqrt{n / \log (n K)}}, \lim \inf_{n} n^{(1 - l_{0}) / 2} \bar{K} \geq \exp (2 ω_{a}^{2} ω_{r}) \sqrt{\log n}$ . Assume κ(n, K) satisfies

s_{\max} \exp (ω_{a}^{2} ω_{r}) (n + K) (\log n + \log K) ≪ κ (n, K) ≪ \frac{n^{2} \bar{K}}{\exp (ω_{a}^{2} ω_{r})} .

(6)

Then, we have $\Pr (\hat{s} = s_{0}) \to 1$ where $\hat{s}$ is defined in Equation 5.

Let $c (n, K) = κ (n, K) {(n + K)}^{- 1} {(\log n + \log K)}^{- 1}$ . When $s_{\max}, ω_{a}, ω_{r}$ are bounded, it follows from Theorem 1 that IC is consistent provided that $c (n, K) \to \infty$ and $c (n, K) = o (n \bar{K} / \log n)$ . Define

τ_{α} (n, K) = \frac{{(n + K)}^{α}}{\max^{α} (n, K)},

for some α ≥ 0. Note that

1 \leq τ_{α} (n, K) \leq 2^{α} .

(7)

Consider the following criteria:

{IC}_{α} (s) = 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - s τ_{α} (n, K) (n + K) (\log n + \log K) \log {\log (n K)} .

(8)

Note that the term $\log {\log (n K)}$ satisfies that $\log {\log (n K)} \to \infty$ and $\log {\log (n K)} = o (n / \log n)$ . It follows from Equation 7 and Theorem 1 that IC_α is consistent for all α ≥ 0. When α > 0, the term $τ_{α} (n, K)$ adjust the model complexity penalty upwards. We notice that Bai and Ng (2002) used a similar finite sample correction term in their proposed information criteria for approximate factor models. Our simulation studies show that such adjustment is essential to achieve selection consistency for large K.

Conditions (A1) and (A2) are directly imposed on the realizations of $a_{1, 0}, \dots, a_{n, 0}$ . In Section B.2 and B.3, we consider an asymptotic framework where $a_{1, 0}, \dots, a_{n, 0}$ are i.i.d according to some distribution function and show (A1) and (A2) hold with probability tending to 1. Therefore, under this framework, we still have $\Pr (\hat{s} = s_{0}) \to 1$ . The consistency of our information criterion remains unchanged.

Observe that we have a total of n × n × K = n²K observations. Consider the following BIC-type criterion:

BIC (s) = 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - s \log (n^{2} K) .

(9)

The model complexity penalty in BIC satisfies

\log (n^{2} K) = 2 \log n + \log K ≪ (n + K) (\log n + \log K) .

Hence, it does not meet Condition (6) in Theorem 1. As a result, BIC may fail to identity the true model. As shown in our simulation studies, BIC will choose overfitted models and is not selection consistent.

4. Numerical Experiments

This section is organized as follows. In Section 4.1, we introduce our algorithm for computing the maximum likelihood estimators of a logistic RESCAL model. Simulation studies are presented in Section 4.2. In Section 4.3, we apply the proposed information criteria to a real dataset.

4.1. Implementation

In this section, we propose an algorithm for computing ${{\hat{a}}_{i}^{(s)}}_{i}$ and ${{\hat{R}}_{k}^{(s)}}_{k}$ . The algorithm is based upon a 3-block alternating direction method of multipliers (ADMM). Set $Θ_{a}^{(s)} = ℝ^{s}$ , $Θ_{r}^{(s)} = ℝ^{s \times s}$ , $[a_{1}^{(s)}, \dots, a_{s}^{(s)}] = I_{s}$ , ${\hat{a}}_{i}^{(s)}$ ’s and ${\hat{R}}_{k}^{(s)}$ ’s are defined by

({{\hat{a}}_{i}^{(s)}}_{i = (s + 1)}^{n}, {{\hat{R}}_{k}^{(s)}}_{k}) = \underset{{a_{i}^{(s)}}_{i = s + 1}^{n}, {R_{k}^{(s)}}_{k}}{\arg \max} l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}),

(10)

where

l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) = \sum_{i j k} (Y_{i j k} {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} - \log [1 + \exp {{(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}}]) .

For any $b_{1}, \dots, b_{n} \in ℝ^{s}$ , define

{\bar{l}}_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}, {b_{i}^{(s)}}_{i}) = \sum_{i j k} (Y_{i j k} {(a_{i}^{(s)})}^{T} R_{k}^{(s)} b_{j}^{(s)} - \log [1 + \exp {{(a_{i}^{(s)})}^{T} R_{k}^{(s)} b_{j}^{(s)}}]) .

Fix $[b_{1}^{(s)}, \dots, b_{s}^{(s)}] = I_{s}$ , the optimization problem in Equation 10 is equivalent to

({{\hat{a}}_{i}^{(s)}}_{i = s + 1}^{n}, {{\hat{R}}_{k}^{(s)}}_{k}, {{\hat{b}}_{i}^{(s)}}_{i = s + 1}^{n}) = \underset{\begin{array}{c} {a_{i}^{(s)}}_{i = s + 1}^{n}, {R_{k}^{(s)}}_{k} \\ {b_{i}^{(s)}}_{i = s + 1}^{n} \end{array}}{\arg \max} {\bar{l}}_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}, {b_{i}^{(s)}}_{i})

subject to a_{i}^{(s)} = b_{i}^{(s)}, \forall i = s + 1, \dots, n .

We then derive its augmented Lagrangian, which gives us

L_{ρ} ({a_{i}^{(s)}}_{i = s + 1}^{n}, {R_{k}^{(s)}}_{k}, {b_{i}^{(s)}}_{i = s + 1}^{n}, {v_{i}^{(s)}}_{i = s + 1}^{n}) = - {\bar{l}}_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}, {b_{i}^{(s)}}_{i}) + \sum_{i = s + 1}^{n} ρ {(a_{i}^{(s)} - b_{i}^{(s)})}^{T} v_{i}^{(s)} + \sum_{i = s + 1}^{n} \frac{ρ}{2} {‖ a_{i}^{(s)} - b_{i}^{(s)} ‖}_{2}^{2},

where ρ > 0 is a penalty parameter and, $v_{s + 1}^{(s)}, \dots, v_{n}^{(s)} \in ℝ^{s}$ .

Applying dual descent method yields the following steps, with l denotes the iteration number:

{a_{i, l + 1}^{(s)}}_{i = s + 1}^{n} = \underset{{a_{i}^{(s)}}_{i = s + 1}^{n}}{\arg \min} L_{ρ} ({a_{i}^{(s)}}_{i = s + 1}^{n}, {R_{k, l}^{(s)}}_{k}, {b_{i, l}^{(s)}}_{i = s + 1}^{n}, {v_{i, l}^{(s)}}_{i = s + 1}^{n}),

(11)

{R_{k, l + 1}^{(s)}}_{k = 1}^{K} = \underset{{R_{k}^{(s)}}_{k = 1}^{K}}{\arg \min} L_{ρ} ({a_{i, l + 1}^{(s)}}_{i = s + 1}^{n}, {R_{k}^{(s)}}_{k}, {b_{i, l}^{(s)}}_{i = s + 1}^{n}, {v_{i, l}^{(s)}}_{i = s + 1}^{n}),

(12)

{b_{i, l + 1}^{(s)}}_{i = s + 1}^{n} = \underset{{b_{i}^{(s)}}_{i = s + 1}^{n} + 1}{\arg \min} L_{ρ} ({a_{i, l + 1}^{(s)}}_{i = s + 1}^{n}, {R_{k, l + 1}^{(s)}}_{k}, {b_{i}^{(s)}}_{i = s + 1}^{n}, {v_{i, l}^{(s)}}_{i = s + 1}^{n}),

(13)

v_{i, l + 1}^{(s)} = v_{i, l}^{(s)} + a_{i, l}^{(s)} - b_{i, l}^{(s)}, \forall i = s + 1, \dots, n .

Let us examine Equation 11–13 in more details. In Equation 11, we rewrite the objective function as

L_{ρ} ({a_{i}^{(s)}}_{i = s + 1}^{n}, {R_{k, l}^{(s)}}_{k}, {b_{i, l}^{(s)}}_{i = s + 1}^{n}, {v_{i, l}^{(s)}}_{i = s + 1}^{n}) = \sum_{i = s + 1}^{n} {\sum_{j, k} (\log [1 + \exp {{(a_{i}^{(s)})}^{T} R_{k, l}^{(s)} b_{j, l}^{(s)}}] - Y_{i j k} {(a_{i}^{(s)})}^{T} R_{k, l}^{(s)} b_{j, l}^{(s)}) + ρ {(a_{i}^{(s)} - b_{i, l}^{(s)})}^{T} v_{i, l}^{(s)} + \frac{ρ}{2} {‖ a_{i}^{(s)} - b_{i, l}^{(s)} ‖}_{2}^{2}} .

Note that L_ρ can be represented as a separable sum of functions. As a result, $a_{i, l + 1}^{(s)}$ ‘s can be solved in parallel. More specifically, we have

a_{i, l + 1}^{(s)} = \underset{a^{(s)}}{\arg \min} {\sum_{j, k} (\log [1 + \exp {{(a_{i}^{(s)})}^{T} R_{k, l}^{(s)} b_{j, l}^{(s)}}] - Y_{i j k} {(a_{i}^{(s)})}^{T} R_{k, l}^{(s)} b_{j, l}^{(s)}) + ρ {(a_{i}^{(s)} - b_{i, l}^{(s)})}^{T} v_{i, l}^{(s)} + \frac{ρ}{2} {‖ a_{i}^{(s)} - b_{i, l}^{(s)} ‖}_{2}^{2}} .

Hence, each $a_{i, l + 1}^{(s)}$ can be computed by solving a ridge type logistic regression with responses ${Y_{i j k}}_{j, k}$ and covariates ${R_{k, l}^{(s)} b_{j, l}^{(s)}}_{j, k}$ .

In Equation 12, each $R_{k, l + 1}^{(s)}$ can be independently updated by solving a logistic regression with responses ${Y_{i j k}}_{i, j}$ and covariates $b_{j, l}^{(s)} \otimes a_{i, l + 1}^{(s)}$ , i.e,

vec (R_{k, l + 1}^{(s)}) = \underset{r_{k}^{(s)} \in ℝ^{s^{2}}}{\arg \min} \sum_{i j} {\log (1 + \exp [{{(b_{j, l}^{(s)})}^{T} \otimes {(a_{i, l + 1}^{(s)})}^{T}} r_{k}^{(s)}]) - Y_{i j k} {{(b_{j, l}^{(s)})}^{T} \otimes {(a_{i, l + 1}^{(s)})}^{T}} r_{k}^{(s)}},

where ⊗ denotes the Kronecker product.

Similar to Equation 11, each $b_{i, l + 1}^{(s)}$ in Equation 13 can be independently computed by solving a ridge type regression with responses ${Y_{i j k}}_{j, k}$ and covariates ${{(R_{k, l + 1}^{(s)})}^{T} a_{j, l + 1}^{(s)}}_{j, k}$ .

Using similar arguments in Theorem 2 in Wang et al. (2017), we can show that the proposed 3-block ADMM algorithm converges for any sufficiently large ρ. In our implementation, we set ρ = nK/2. To guarantee global convergence, we randomly generate multiple initial estimators and solve the optimization problem multiple times based on these initial values.

4.2. Simulations

We simulate ${Y_{i j k}}_{i j k}$ from the following model:

\Pr (Y_{i j k} = 1 | {a_{i}}_{i}, {R_{k}}_{k}) = \frac{\exp (a_{i}^{T} R_{k} a_{j})}{1 + \exp (a_{i}^{T} R_{k} a_{j})},

a_{1}, a_{2}, \dots, a_{n} \overset{i i d}{~} N (0, 1),

R_{1} = R_{2} = \dots = R_{K} = diag (\underset{s_{0}}{\underset{︸}{1, - 1, 1, - 1, \dots, 1, - 1}})

where N(0, 1) stands for a standard normal random variable and $diag (v_{1}, \dots, v_{q})$ denotes a q × q diagonal matrix with the jth element equal to v_j.

We consider six simulation settings. In the first three settings, we fix K = 3 and set n = 100,150 and 200, respectively. In the last three settings, we increase K to 10, 20, 50, and set n = 50. In each setting, we further consider three scenarios, by setting s₀ = 2, 4 and 8. Let s_max = 12. The ADMM algorithm proposed in Section 4.1 is implemented in R. Some subroutines of the algorithm are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015) to facilitate the computation. We compare the proposed IC_α (see Equation 8) with the BIC-type criterion (see Equation 9). In IC_α, we set α = 0, 0.5 and 1. Note that when α = 0, we have

τ_{α} (n, K) = \frac{{(n + K)}^{α}}{\max^{α} (n, K)} = 1.

Reported in Table 1 and 2 are the percentage of selecting the true models (TP) and the average of $\hat{s}$ selected by IC₀, IC_0.5, IC₁ and BIC over 100 replications.

Table 1:

Simulation results for Setting I, II and III (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 8
n = 100, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.97 (0.02)	2.03 (0.02)	0.97 (0.02)	4.03 (0.02)	0.90(0.03)	7.90 (0.03)
IC_0.5	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.90(0.03)	7.90 (0.03)
IC₁	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.89(0.03)	7.89 (0.03)
BIC	0.00 (0.00)	11.99 (0.01)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)

n = 150, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
IC₁	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.98 (0.01)

n = 200, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
IC₁	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)	0.00 (0.00)	11.98 (0.01)

Open in a new tab

Table 2:

Simulation results for Setting IV, V and VI (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 8
n = 50, K = 10	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	1.00 (0.00)	2.00 (0.00)	0.97 (0.02)	4.03 (0.02)	0.69(0.05)	7.91 (0.06)
IC_0.5	1.00 (0.00)	2.00 (0.00)	0.97 (0.02)	4.03 (0.02)	0.66(0.05)	7.75 (0.06)
IC₁	1.00 (0.00)	2.00 (0.00)	0.98 (0.01)	4.02 (0.01)	0.60(0.05)	7.62 (0.06)
BIC	0.00 (0.00)	11.81 (0.06)	0.00 (0.00)	11.60 (0.06)	0.01 (0.01)	11.67 (0.07)

n = 50, K = 20	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.97 (0.02)	2.03 (0.02)	0.95 (0.02)	4.05 (0.02)	0.73(0.04)	8.46 (0.10)
IC_0.5	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.87(0.03)	8.09 (0.03)
IC₁	0.98 (0.01)	2.02 (0.02)	1.00 (0.00)	4.00 (0.00)	0.79(0.04)	7.99 (0.05)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.92 (0.03)	0.00 (0.00)	11.99 (0.01)

n = 50, K = 50	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.98 (0.01)	2.02 (0.01)	0.93 (0.03)	4.07 (0.03)	0.17(0.04)	11.24 (0.15)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.76(0.04)	8.24 (0.05)
IC₁	1.00 (0.00)	2.00 (0.00)	0.98 (0.01)	4.02 (0.01)	0.79(0.04)	7.99 (0.05)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)

Open in a new tab

It can be seen from Table 1 and 2 that BIC fails in all settings. It always selects overfitted models. On the contrary, the proposed information criteria are consistent for most of the settings. For example, under settings where s₀ = 2 or 4, TPs of IC₀, IC_0.5 and IC₁ are larger than or equal to 93%. When s₀ = 8, expect for the last setting, TPs of the proposed information criteria are no less than 60% for all cases.

IC₀, IC_0.5 and IC₁ perform very similarly for small K. In the first three settings, TPs of these three information criteria are nearly the same for all cases. However, IC_0.5 and IC₁ are more robust than IC₀ for large K. This can be seen in the last scenario of Setting 6, where the TP of IC₀ is no more than 20%. Besides, in the last two settings, TP of IC₀ is smaller than IC_0.5 and IC₁ for all cases. These differences are due to the finite sample correction term $τ_{α} (n, K)$ . As commented before, $τ_{0.5} (n, K)$ and $τ_{1} (n, K)$ increase the model complexity penalty term in IC_0.5 and IC₁ to avoid overfitting for large K.

In Section D of the Appendix, we examine the performance of our proposed information criteria under the scenario where $a_{1}, a_{2}, \dots, a_{n} \overset{i i d}{~} N (0, {{0.5}^{| i - j |}}_{i, j = 1, \dots, s_{0}})$ . Results are similar to those presented at Table 1 and 2.

4.3. Real Data Experiments

In this section, we apply the proposed information criteria to the “Social Evolution” dataset (Madan et al., 2012). This dataset comes from MIT’s Human Dynamics Laboratory. It tracks everyday life of a whole undergraduate MIT dormitory from October 2008 to May 2009. We use the survey data, resulting in n = 84 participants and K = 5 binary relations. The five relations are: close relationship, political discussion, social interaction and two social media interaction.

We compute ${{\hat{a}}_{i}^{(s)}}_{i} and {{\hat{R}}_{k}^{(s)}}_{k}$ for $s = {1, \dots, 12}$ and select the number of latent factors using the proposed information criteria and BIC. It turns out that IC₀, IC_0.5 and IC₁ all suggest the presence of 9 factors. In contrast, BIC selects 12 factors. To further evaluate the number of latent factors selected by the proposed information criteria, we consider the following cross-validation procedure. For any $s \in [1, \dots, 12]$ , we randomly select 80% of the observations and estimate ${{\hat{a}}_{i}^{(s)}}_{i}$ and ${{\hat{R}}_{k}^{(s)}}$ by maximizing the observed likelihood function based on these training samples. Then we compute

{\hat{π}}_{i j k} = \frac{\exp {{({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}}}{1 + \exp {{({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}}} .

Based on these predicted probabilities, we calculate the area under the precision-recall curve (AUC) on the remaining 20% testing samples.

Reported in Table 3 are the AUC scores averaged over 100 replications. For any $s \in {1, \dots, 12}$ , we denoted by AUC_s the corresponding AUC score. It can be seen from Table 3 that AUC_s first increases and then decreases as s increases. The maximum AUC score is achieved at s = 10. Observe that AUC₉ is very close to AUC₁₀, and it is larger than the remaining AUC scores. This demonstrates that the proposed information criteria select less latent factors while achieve better or similar link prediction results when compared to BIC.

Table 3:

AUC scores

s	1	2	3	4	5	6	7	8	9	10	11	12
AUC	0.7201	0.8341	0.8952	0.9095	0.9257	0.9364	0.9444	0.9486	0.9513	0.9518	0.9485	0.9467

Open in a new tab

5. Discussion

In this paper, we propose information criteria for selecting the number of latent factors in the RESCAL tensor factorization model and prove their model selection consistency. Although we focus on the logistic RESCAL model, the proposed information criteria can be applied to general tensor factorization models. More specifically, consider the following class of models:

Y_{i j k} = g (a_{i, 0}^{T} R_{k, 0} b_{j, 0}) + e_{i j k}, \forall i, j \in {1, \dots, n}, k \in {1, \dots, K},

(14)

with any of (or without) the following constraints:

(C1) $R_{k, 0}$ is diagonal;

(C2) $a_{i, 0} = b_{i, 0}$ for $i \in {1, \dots, n}$ ,

for some strictly increasing function g, $a_{i, 0}, b_{i, 0} \in ℝ^{s_{0}}, R_{k, 0} \in ℝ^{s_{0} \times s_{0}}$ and some mean zero random errors ${e_{i j k}}_{i j k}$ .

As commented in Section 2.1, such representation includes the RESCAL, CP and TUCKER-2 models. Specifically, it reduces to the TUCKER-2 model by setting g to be the identity function. If further (C1) holds, then the model in Equation 14 reduces to the CP model. When (C2) holds, it corresponds to the RESCAL model. Consider the following information criteria,

IC (s) = l_{n} (Y; {{\hat{a}}_{i}}_{i}, {{\hat{R}}_{k}}_{k}, {{\hat{b}}_{i}}_{i}) - s κ (n, K),

where $l_{n}$ stands for the likelihood function and ${\hat{a}}_{i}, {\hat{R}}_{k}, {\hat{b}}_{i}$ are the corresponding (constrained) MLEs. Similar to Theorem 1, we can show that with some properly chosen $κ (n, K)$ , IC is consistent under this general setting.

Currently, we assume the tensor Y is completely observed. When some of the Y_ijk’s are missing, we can calculate ${\hat{a}}_{i}^{(s)}$ ’s and ${\hat{R}}_{k}^{(s)}$ ’s by maximizing the following observed likelihood function

\underset{\begin{array}{c} a_{1}^{(s)}, \dots, a_{n}^{(a)} \in Θ_{a}^{(s)} \\ vec (R_{1}^{(s)}), \dots, vec (R_{K}^{(s)}) \in Θ_{R}^{(s)} \end{array}}{\arg \max} \sum_{(i, j, k) \in N_{o}_{b s}} (Y_{i j k} {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} - \log [1 + \exp {{(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}}]),

subject to [a_{1}^{(s)}, \dots, a_{s}^{(s)}] = I_{s},

where N_obs denotes the set of the observed responses. The above optimization problem can also be solved by a 3-block ADMM algorithm. Define the following class of information criteria,

{IC}_{o b s} (s) = \sum_{(i, j, k) \in N_{o b s}} (Y_{i j k} {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} - \log [1 + \exp {{({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}}]) - \hat{p} s κ (n, K),

where $\hat{p} = | N_{o b s} | / (n^{2} K)$ denotes the percentage of observed responses. Consistency of IC_obs can be similarly studied.

Acknowledgments

The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.

Appendix A. More on the Identifiability Constraint

Let π(·) be a permutation function of ${1, \dots, n}$ . Alternative to our estimator defined in Equation 3, one may consider

({{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) = \underset{\underset{Vec (R_{1}^{(s)}), \dots, Vec (R_{K}^{(s)}) \in Θ_{r}^{(s)}}{a_{1}^{(s)}, \dots, a_{n}^{(s)} \in Θ_{a}^{(s)}}}{\arg \max} l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}),

subject to [a_{π (1)}^{(s)}, \dots, a_{π (s)}^{(s)}] = I_{s},

and the corresponding information criteria

{IC}_{π} (s) = 2 l_{n} (Y; {{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) - s κ (n, K) .

Since $l_{n} (Y; {a_{i}^{(s)}}_{i}, R_{k}^{(s)}_{k}) = l_{n} (Y; {C^{- 1} a_{i}^{(s)}}_{i}, {C^{T} R_{k}^{(s)} C}_{k})$ for any invertible matrix $C \in ℝ^{s \times s}, ({{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k})$ is also the maximizer of $l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k})$ subject to the constraint that $[a_{π (1)}^{(s)}, \dots, a_{π (s)}^{(s)}]$ is invertible. Similarly, the estimator $({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k})$ is the maximizer of $l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k})$ subject to the constraint that $[a_{1}^{(s)}, \dots, a_{s}^{(s)}]$ is invertible. As a result, we have IC(s) = IC_π(s) as long as

[{\hat{a}}_{π (1)}^{(s)}, \dots, {\hat{a}}_{π (s)}^{(s)}] is invertible and [{\hat{a}}_{1, π}^{(s)}, \dots, {\hat{a}}_{s, π}^{(s)}] is invertible .

(15)

However, it remains unknown whether Equation 15 holds or not. Hence, there’s no guarantee that IC(s) = IC_π(s). This means the choice of the identifiability constraint might affect the value of our proposed information criterion.

In the following, we prove

\Pr (\underset{\begin{matrix} π {1, \dots, n} \to {1, \dots, n} \\ | {π (1), \dots, π (n)} | = n \end{matrix}}{\cap} {\underset{s = 1, \dots, s_{\max}}{\arg \max} {{IC}_{π (s)}} = s_{0}}) \to 1.

(16)

This means with probability tending to 1, all the information criteria with different identifiability constraints will select the true model. Therefore, the choice of the identifiability constraint will not affect the performance of our method. For any $s \in {s_{0}, \dots, s_{\max}}$ , let $A_{s, 0, π} = {[a_{π (1), 0}^{(s)}, \dots, a_{π (s), 0}^{(s)}]}^{T}$ ,

a_{i, π}^{(s) *} = {(A_{s, 0, π}^{- 1})}^{T} a_{i, 0}^{(s)} and R_{k, π}^{(s) *} = A_{s, 0, π} R_{k, 0}^{(s)} A_{s, 0, π}^{T} .

We need the following condition.

(A4) Assume $a_{i, π}^{(s) *} \in Θ_{a}^{(s)}$ and $vec (R_{k, π}^{(s) *}) \in Θ_{r}^{(s)}, \forall i = 1, \dots, n, k = 1, \dots, K, s_{0} \leq s \leq s_{\max}$ and any permutation function π(·). In addition, assume $\sup_{x \in Θ_{a}^{(s)}} ‖ x ‖_{2} \leq ω_{a}$ , $\sup_{y \in Θ_{r}^{(s)}} ‖ y ‖_{2} \leq ω_{r}$ for some $ω_{a}, ω_{r} > 0$ .

Corollary 1.

Assume (A2)-(A4) hold, $K ~ n^{l_{0}}$ for some $0 \leq l_{0} \leq 1, s_{\max} = o {\sqrt{n / \log (n K)}} \lim \inf_{n} n^{(1 - l_{0}) / 2} \bar{K} \geq \exp (2 ω_{a}^{2} ω_{r}) \sqrt{\log n}$ . Assume $κ (n, K)$ satisfies Condition (6). Then, (16) is satisfied.

Appendix B. More on the Technical Conditions

B.1. Discussion of Condition (A0)

In this section, we show the necessity of (A0) when the matrices $R_{1, 0}, \dots, R_{K, 0}$ are symmetric. More specifically, when (A0) doesn’t hold, we show there exist some $0 \leq s < s_{0}, {\bar{a}}_{1, 0}, \dots, {\bar{a}}_{n, 0} \in ℝ^{s}$ and ${\bar{R}}_{1, 0}, \dots, {\bar{R}}_{K, 0} \in ℝ^{s \times s}$ such that

{\bar{a}}_{i, 0}^{T} {\bar{R}}_{k, 0} {\bar{a}}_{j, 0} = a_{i, 0}^{T} R_{k, 0} a_{j, 0}, \forall 1 \leq i, j \leq n, 1 \leq k \leq K .

(17)

Let’s first consider the case where rank(A₀) = s for some s < s₀. Thus, it follows that

A_{0} = {\bar{A}}_{0} C,

for some ${\bar{A}}_{0} \in ℝ^{n \times s}$ and $C \in ℝ^{s \times s_{0}}$ . Set ${\bar{a}}_{i, 0}$ to be the ith row of ${\bar{A}}_{0}$ and ${\bar{R}}_{k, 0} = C R_{k, 0} C^{T}$ , Equation 17 is thus satisfied. In addition, the new matrix ${\bar{A}}_{0}$ shall have full column rank.

Let $R_{0} = {(R_{1, 0}^{T}, \dots, R_{K, 0}^{T})}^{T},$ . Consider the case where rank(R₀) = s for some s < s₀. It follows from the singular value decomposition that

R_{0} = U_{0} Λ_{0} V_{0}^{T},

(18)

for some diagonal matrix $Λ_{0} \in ℝ^{s \times s}$ and some matrices $U_{0} \in ℝ^{K s_{0} \times s}$ , $V_{0} \in ℝ^{s_{0} \times s}$ that satisfy $U_{0}^{T} U_{0} = V_{0}^{T} V_{0} = I_{s}$ . Denoted by U_{k, 0} the submatrix of U₀ formed by rows in ${(k - 1) s_{0} + 1, (k - 1) s_{0} + 2, \dots, k s_{0}}$ and columns in {1, 2,...,s}. It follows from Equation 18 that

R_{k, 0} = U_{k, 0} Λ_{0} V_{0}^{T}, \forall 1 \leq k \leq K .

Since R_k,0 is symmetric, we have $U_{k, 0} Λ_{0} V_{0}^{T} = V_{0} Λ_{0} U_{k, 0}^{T} = R_{k, 0}$ . Notice that $V_{0}^{T} V_{0} = I_{s}$ . It follows that $U_{k, 0} Λ_{0} = V_{0} Λ_{0} U_{k, 0}^{T} V_{0}$ . Therefore, we have

R_{k, 0} = V_{0} Λ_{0} U_{k, 0}^{T} V_{0} V_{0}^{T} .

(19)

Define ${\bar{a}}_{i, 0} = V_{0}^{T} a_{i, 0}, \forall 1 \leq i \leq n$ and ${\bar{R}}_{k, 0} = Λ_{0} U_{k, 0}^{T} V_{0}$ . In view of Equation 19, it is immediate to see that Equation 17 holds. Since $V_{0}^{T} V_{0} = I_{s}$ , we have ${\bar{R}}_{k, 0} = V_{0}^{T} V_{0} Λ_{0} U_{k, 0}^{T} V_{0} = V_{0}^{T} R_{k, 0} V_{0}$ . As a result, ${\bar{R}}_{1, 0}, \dots, {\bar{R}}_{K, 0}$ are also symmetric. Suppose ${\bar{R}}_{0} = {({\bar{R}}_{1, 0}^{T}, \dots, {\bar{R}}_{K, 0}^{T})}^{T}$ doesn’t have full column rank. Using the same arguments, we can find some $s^{'} < s, {\tilde{a}}_{1, 0}, \dots, {\tilde{a}}_{n, 0} \in ℝ^{s^{'}}, {\tilde{R}}_{1, 0}, \dots, {\tilde{R}}_{K, 0} \in ℝ^{s^{'} \times s^{'}}$ such that

{\tilde{a}}_{i, 0}^{T} {\tilde{R}}_{k, 0} {\tilde{a}}_{j, 0} = a_{i, 0}^{T} R_{k, 0} a_{j, 0}, \forall 1 \leq i, j \leq n, 1 \leq k \leq K .

(20)

We may repeat this procedure until we find some ${\tilde{a}}_{i, 0}$ , ${\tilde{R}}_{k, 0}$ ’s satisfy Equation 20 and that the matrix ${({\tilde{R}}_{1, 0}^{T}, \dots, {\tilde{R}}_{K, 0}^{T})}^{T}$ has full column rank.

B.2. Discussion of Condition (A1)

In this section, we consider an asymptotic framework where $a_{1, 0}, \dots, a_{n, 0}$ are i.i.d according to some distribution function and show (A1) holds with probability tending to 1. For any q-dimensional vector $q$ , let $‖ q ‖_{2}$ denote its Euclidean norm. For any m × q matrix Q, $‖ Q ‖_{2}$ stands for the spectral norm of Q while $‖ Q ‖_{F}$ denotes its Frobenius norm. For simplicity, we assume $Θ_{a} = {x \in ℝ^{s_{0}} : ‖ x ‖_{2} \leq ω_{a}}$ and $Θ_{r} = {y \in ℝ^{s_{0}^{2}} : ‖ y ‖_{2} \leq ω_{r}}$ . Assume $\max_{k = 1, \dots, K} {‖ R_{k, 0} ‖}_{F} = O (1), {‖ a_{1, 0} ‖}_{2}$ is bounded with probability 1. In addition, assume there exist some constants $c_{0}, t_{0} > 0$ such that

\Pr ({‖ {[a_{1, 0}, \dots, a_{s_{0}, 0}]}^{- 1} ‖}_{F} > t) \leq c_{0} t^{- 1}, \forall t \leq t_{0} .

(21)

When s₀ = 1, the above condition is closely related to the margin assumption (Tsybakov, 2004; Audibert and Tsybakov, 2007) in the classification literature. It automatically holds when a_1,0 has a bounded probability density function.

In the following, we show with proper choice of ω_a and ω_r, (A1) holds with probability tending to 1. By the identifiability constraints, ${‖ a_{i}^{(s) *} ‖}_{2} = 1, \forall i = 1, \dots, s$ and $s = s_{0}, \dots, s_{\max}$ . When $ω_{a}, ω_{r} \to \infty$ , we have for sufficiently large n,

\Pr (a_{i}^{(s) *} \in Ω_{a}) = 1, \forall 1 \leq i \leq s, s_{0} \leq s \leq s_{\max} .

(22)

By the definition of A_s,0, we have

A_{s, 0}^{- 1} = (\begin{matrix} A_{s_{0}, 0}^{- 1} & O_{s_{0} \times (s - s_{0})} \\ - A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} & I_{s - s_{0}} \end{matrix}),

(23)

where $A_{(s_{0} + 1) : s, 0} = {[a_{s_{0} + 1, 0}, \dots, a_{s, 0}]}^{T}$ .It follows that

a_{i}^{(s) *} = (\begin{matrix} A_{s_{0}, 0}^{- 1} a_{i, 0} \\ {(\underset{i - s_{0} - 1}{\underset{︸}{0, \dots, 0}}, 1, \underset{s - i}{\underset{︸}{0, \dots, 0}})}^{T} - A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} a_{i, 0} \end{matrix}), \forall s < i \leq n, s_{0} \leq s \leq s_{\max} .

Under the given conditions, we have ${‖ a_{i, 0} ‖}_{2} \leq ω_{0}$ with probability 1 for some constant ω₀ > 0. Therefore, we have

\max_{\underset{s < i \leq n}{s_{0} \leq s \leq s \leq s_{\max}}} {‖ a_{i}^{{(s)}^{*}} ‖}_{2} \leq ω_{0} {‖ A_{s_{0}, 0}^{- 1} ‖}_{2} + 1 + \sqrt{s_{\max}} ω_{0}^{2} {‖ A_{s_{0}, 0}^{- 1} ‖}_{2} \leq 1 + (1 + \sqrt{s_{\max}} ω_{0}) ω_{0} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F} .

By (21), it is immediate to see that

\Pr (\max_{\underset{1 \leq i \leq n}{s_{0} \leq s \leq s_{\max}}} {‖ a_{i}^{{(s)}^{*}} ‖}_{2} > ω_{a}) \to 0,

as long as $ω_{a} ⩾ \sqrt{s_{max}}$ . This together with Equation 22 yields

\Pr (\max_{\underset{1 \leq i \leq n}{_{s_{0} \leq s \leq s_{\max}}}} {‖ a_{i}^{{(s)}^{*}} ‖}_{2} > ω_{a}) \to 0.

Combining Equation 23 with the definition of $R_{k}^{(s) *}$ yields

R_{k}^{(s) *} = (\begin{matrix} A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} & - A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} A_{(s_{0} + 1) : s, 0}^{T} \\ - A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0})}^{T} & A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} A_{(s_{0} + 1); s, 0}^{T} \end{matrix})

Since $\max_{1 \leq k \leq K} {‖ R_{k, 0} ‖}_{F} = O (1), \max_{1 \leq i \leq n} {‖ a_{i, 0} ‖}_{2} = O (1)$ we have

\max_{1 \leq k \leq K} {‖ A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} ‖}_{F} \leq \max_{1 \leq k \leq K} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2} {‖ R_{k, 0} ‖}_{F} = O ({‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2}),

and

\max_{\underset{s_{0} \leq s \leq s_{\max}}{1 \leq k \leq K}} {‖ A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} ‖}_{F} \leq \max_{\underset{s_{0} \leq s \leq \sin n x}{1 \leq k \leq K}} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2} {‖ A_{(s_{0} + 1) : s, 0} ‖}_{F} {‖ R_{k, 0} ‖}_{F} = O (\sqrt{s_{\max}} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2}), \max_{\underset{s_{0} \leq s \leq s_{\max}}{1 \leq k \leq K}} {‖ A_{(s_{0} + 1) : s, 0} A_{s_{0}, 0}^{- 1} R_{k, 0} {(A_{s_{0}, 0}^{- 1})}^{T} A_{(s_{0} + 1) : s, 0}^{T} ‖}_{F} \leq \max_{\underset{s_{0} \leq s \leq s_{\max}}{1 \leq k \leq K}} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2} {‖ A_{(s_{0} + 1); s, 0} ‖}_{F}^{2} {‖ R_{k, 0} ‖}_{F} = O (s_{\max} {‖ A_{s_{0}, 0}^{- 1} ‖}_{F}^{2}) .

By (21), we have $\max_{1 \leq k \leq K, s_{0} \leq s \leq s_{\max}} {‖ R_{k}^{(s) *} ‖}_{F} \leq ω_{r}$ with probability tending to 1, for any ω_r such that $ω_{r} / s_{\max} \to \infty$ .

B.3. Discussion of Condition (A2)

Assume the matrix $E a_{1, 0} a_{1, 0}^{T}$ is positive definite. Since s₀ is fixed, it follows from the law of large numbers that

\frac{1}{n} \sum_{i = 1}^{n} a_{i, 0} a_{i, 0}^{T} \overset{P}{\to} {E a}_{1, 0} a_{1, 0}^{T} .

Therefore, (A2) holds with probability tending to 1.

Appendix C. Proofs

In the following, we provide proofs of Lemma 1, Lemma 2 and Theorem 1. We define

l_{0} ({a_{i}}_{i}, {R_{k}}_{k}) = E l_{n} (Y; {a_{i}}_{i}, {R_{k}}_{k}),

for any $a_{1}, \dots, a_{n} \in ℝ^{s}$ , $R_{1}, \dots, R_{K} \in ℝ^{s \times s}$ and any integer s ≥ 1. Define $θ_{i j k} = {(a_{i}^{*})}^{T} R_{k} a_{j}^{*}$ and ${\hat{θ}}_{i j k} = {\hat{a}}_{i}^{T} {\hat{R}}_{k} {\hat{a}}_{j}$

C.1. Proof of Lemma 1

Assume there exist some ${a_{i}}_{i}, {R_{k}}_{k}$ such that

g (a_{i, 0}^{T} R_{k, 0} a_{j, 0}) = g (a_{i}^{T} R_{k} a_{j}), \forall i, j, k .

Since g(·) is strictly monotone, we have

a_{i, 0}^{T} R_{k, 0} a_{j, 0} = a_{i}^{T} R_{k} a_{j}, \forall i, j, k,

or equivalently,

(\begin{matrix} A R_{1} \\ ⋮ \\ A R_{K} \end{matrix}) A^{T} = (\begin{matrix} A_{0} R_{1, 0} \\ ⋮ \\ A_{0} R_{K, 0} \end{matrix}) A_{0}^{T} .

where $A = {[a_{1}, a_{2}, \dots, a_{n}]}^{T}$ . Thus, it follows that

(\begin{matrix} A_{0}^{T} A R_{1} \\ ⋮ \\ A_{0}^{T} A R_{K} \end{matrix}) A^{T} = (\begin{matrix} A_{0}^{T} A_{0} R_{1, 0} \\ ⋮ \\ A_{0}^{T} A_{0} R_{K, 0} \end{matrix}) A_{0}^{T} .

By (A0), the matrix $A_{0}^{T} A_{0}$ is invertible. As a result, we have

(\begin{matrix} {(A_{0}^{T} A_{0})}^{- 1} A_{0}^{T} A R_{1} \\ ⋮ \\ {(A_{0}^{T} A_{0})}^{- 1} A_{0}^{T} A R_{K} \end{matrix}) A^{T} = (\begin{matrix} R_{1, 0} \\ ⋮ \\ R_{K, 0} \end{matrix}) A_{0}^{T} .

Therefore,

(\sum_{k = 1}^{K} R_{k, 0}^{T} {(A_{0}^{T} A_{0})}^{- 1} A_{0}^{T} A R_{1}) A^{T} = (\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0}) A_{0}^{T} .

Notice that the matrix $\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0}$ is invertible under Condition (A0). It follows that

\underset{C}{\underset{︸}{{(\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0})}^{- 1} (\sum_{k = 1}^{K} R_{k, 0}^{T} {(A_{0}^{T} A_{0})}^{- 1} A_{0}^{T} A R_{k})}} A^{T} = A_{0}^{T} .

By Lemma 5.1 in Banerjee and Roy (2014), we have rank(C) ≥ rank(A₀) = s₀. Therefore, C is invertible. It follows that

A = A_{0} {(C^{T})}^{- 1},

(24)

or equivalently,

a_{i} = C^{- 1} a_{i, 0}, \forall i = 1, \dots, n .

By Equation 24, we obtain $A_{0} {(C^{T})}^{- 1} R_{k} C^{- 1} A_{0}^{T} = A_{0} R_{k, 0} A_{0}^{T}, \forall k$ and hence

A_{0}^{T} A_{0} {(C^{T})}^{- 1} R_{k} C^{- 1} A_{0}^{T} A_{0} = A_{0}^{T} A_{0} R_{k, 0} A_{0}^{T} A_{0}, \forall k = 1, \dots, K .

Since $A_{0}^{T} A_{0}$ is invertible, this further implies ${(C^{T})}^{- 1} R_{k} C^{- 1} = R_{k, 0}, \forall k,$ or equivalently,

R_{k} = C^{T} R_{k, 0} C, \forall k = 1, \dots, K .

The proof is hence completed.

C.2. Proof of Lemma 2

To prove Lemma 2, we need the following lemma.

Lemma 4 (Mendelson et al. (2008), Lemma 2.3).

Given d ≥ 1, and ε > 0, we have

N (ε, B_{2}^{d}, ‖ \cdot ‖_{2}) \leq {(1 + \frac{2}{ε})}^{d},

where $B_{2}^{d}$ is the unit ball in $ℝ^{d}$ , and $N (ε, \cdot, ‖ \cdot ‖_{2})$ the covering number with respect to the Euclidean metric (see Definition 2.2.3 in van der Vaart and Wellner (1996) for details).

Under Condition (Al), we have $l_{n} (Y; {a_{i}^{(s) *}}_{i}, {R_{k}^{(s) *}}_{k}) \leq l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) .$ and hence

l_{n} (Y; {a_{i}^{(s) *}}_{i}, {R_{k}^{(s) *}}_{k}) \leq l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) .

(25)

Besides, we have

\max_{i} {‖ {\hat{a}}_{i}^{(s)} ‖}_{2} \leq ω_{a}, \max_{k} {‖ vec ({\hat{R}}_{k}^{(s)}) ‖}_{2} \leq ω_{r}, \forall s \in {s_{0}, \dots, s_{\max}},

(26)

and

\max_{i} {‖ a_{i}^{*} ‖}_{2} \leq ω_{a}, \max_{k} {‖ vec (R_{k}^{*}) ‖}_{2} \leq ω_{r} .

(27)

Therefore,

\max_{i, j, k} | {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} | \leq (\max_{i} {‖ {\hat{a}}_{i}^{(s)} ‖}_{2}^{2} \max_{k} {‖ {\hat{R}}_{k}^{(s)} ‖}_{2}) \leq (\max_{i} {‖ {\hat{a}}_{i}^{(s)} ‖}_{2}^{2} \max_{k} {‖ {\hat{R}}_{k}^{(s)} ‖}_{F}) \leq (\max_{_{i}} {‖ {\hat{a}}_{i}^{(s)} ‖}_{2}^{2} \max_{k} {‖ vec ({\hat{R}}_{k}^{(s)}) ‖}_{2}) \leq ω_{a}^{2} ω_{r}, \forall s \in {s_{0}, \dots, s_{\max}} .

(28)

Similarly, we can show

\max_{i, j, k} {‖ {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*} ‖}_{2} \leq ω_{a}^{2} ω_{r} .

(29)

We define $θ_{i j k}^{*} = {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*} and {\hat{θ}}_{i j k}^{(s)} = {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}$ . It follows from a second-order Taylor expansion that

g (θ_{i j k}^{*}) (θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)}) - \log {1 + \exp (θ_{i j k}^{*})} + \log {1 + \exp ({\hat{θ}}_{i j k}^{(s)})} = \frac{\exp ({\tilde{θ}}_{i j k}^{(s)}) {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}}{2 {1 + \exp ({\tilde{θ}}_{i j k}^{(s)})}^{2}},

(30)

for some ${\tilde{θ}}_{i j k}^{(s)}$ lying on the line segment joining $θ_{i j k}^{*}$ and ${\hat{θ}}_{i j k}^{(s)}$ . By (28) and (29), we have for any i, j, k and $s \in {s_{0}, \dots, s_{\max}}, | {\tilde{θ}}_{i j k}^{(s)} | \leq ω_{a}^{2} ω_{r}$ . This together with Equation 30 gives that

g (θ_{i j k}^{*}) θ_{i j k}^{*} - \log {1 + \exp (θ_{i j k}^{*})} - g (θ_{i j k}^{*}) {\hat{θ}}_{i j k}^{(s)} + \log {1 + \exp ({\hat{θ}}_{i j k}^{(s)})} \geq \frac{{(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2} \exp (ω_{a}^{2} ω_{r})}{2 {1 + \exp (ω_{a}^{2} ω_{r})}^{2}} \geq \frac{{(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2} \exp (ω_{a}^{2} ω_{r})}{8 \exp (2 ω_{a}^{2} ω_{r})} = \frac{{(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}}{8 \exp (ω_{a}^{2} ω_{r})},

and hence

\begin{array}{l} l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \\ = \sum_{i j k} (g (θ_{i j k}^{*}) (θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)}) - \log \frac{{1 + \exp (θ_{i j k}^{*})}}{{1 + \exp ({\hat{θ}}_{i j k}^{(s)})}}) \geq \frac{\sum_{i j k} {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}}{8 \exp (ω_{a}^{2} ω_{r})} . \end{array}

(31)

In the following, we provide an upper bound for

\begin{array}{l} \max_{_{s \in {1, \dots, s_{\max}}}} \sup_{\underset{\underset{\max_{k} {‖ vec (R_{k}) ‖}_{2} \leq ω_{r}}{\max_{i} {‖ a_{i} ‖}_{2} \leq ω_{a}}}{a_{i} \in ℝ^{a}, R_{k} \in ℝ^{s \times s}}} | l_{0} ({a_{i}}_{i}, {R_{k}}_{k}) - l_{n} (Y; {a_{i}}_{i}, {R_{k}}_{k}) | \\ \max_{_{s \in {1, \dots, s_{\max}}}} \sup_{\underset{\underset{\max_{k} {‖ vec (R_{k}) ‖}_{2} \leq ω_{r}}{\max_{i} {‖ a_{i} ‖}_{2} \leq ω_{a}}}{a_{i} \in ℝ^{a}, R_{k} \in ℝ^{s \times s}}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) a_{i}^{T} R_{k} a_{j} |, \end{array}

where $π_{i j k}^{*} = \exp (θ_{i j k}^{*}) / {1 + \exp (θ_{i j k}^{*})}, \forall i, j, k$ .

Let $ε_{a} = ω_{a} / {(n K)}^{2}$ and ${{\bar{a}}_{1}^{(s)}, \dots, {\bar{a}}_{N_{s, ε_{a}}}^{(s)}}$ be a minimal ε_a-net of the vector space $({a \in ℝ^{s} : ‖ a ‖_{2} \leq ω_{a}}, ‖ \cdot ‖_{2})$ . It follows from Lemma 4 that

N_{s, ε_{a}} = N (ε_{a}, {a \in ℝ^{s} : ‖ a ‖_{2} \leq ω_{a}}, ‖ \cdot ‖_{2}) = N (1 / {(n K)}^{2}, B_{2}^{s}, ‖ \cdot ‖_{2}) \leq {(1 + 2 n^{2} K^{2})}^{s} \leq {(3 n K)}^{2 s} .

(32)

Let $ε_{r} = ω_{r} / {(n K)}^{2}$ , and ${{\bar{R}}_{1}^{(s)}, \dots, {\bar{R}}_{N_{s^{2}, ε_{r}}}^{(s)}}$ be a minimal ε_r-net of the vector space $({R \in ℝ^{s \times s} : ‖ vec (R) ‖_{2} \leq ω}, ‖ \cdot ‖_{F})$ . For any s × s matrices Q, we have $‖ Q ‖_{F} = ‖ vec (Q) ‖_{2}$ . Similar to (32), we can show that

N_{s^{2}, ε_{r}} \leq {(3 n K)}^{2 s^{2}} .

(33)

Hence, for any $a_{i}, a_{j} \in ℝ^{s} and R_{k} \in ℝ^{s \times s}$ satisfying ${‖ a_{i} ‖}_{2}, {‖ a_{j} ‖}_{2} \leq ω_{a}, {‖ vec (R_{k}) ‖}_{2} \leq ω_{r}$ there exist some ${\bar{a}}_{l_{j}}^{(s)}, {\bar{a}}_{l_{j}}^{(s)}, {\bar{R}}_{t_{k}}^{(s)}$ , such that

{‖ a_{i} - {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} \leq \frac{ω_{a}}{{(n K)}^{2}}, {‖ a_{j} - {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \leq \frac{ω_{a}}{{(n K)}^{2}}, {‖ R_{k} - {\bar{R}}_{t_{k}}^{(s)} ‖}_{F} \leq \frac{ω_{r}}{{(n K)}^{2}} .

(34)

This further implies

| {(a_{i})}^{T} R_{k} a_{j} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | \leq | {(a_{i})}^{T} R_{k} a_{j} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} R_{k} a_{j} | + | {({\bar{a}}_{l_{i}}^{(s)})}^{T} R_{k} a_{j} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} a_{j} | + | {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} a_{j} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | \leq {‖ a_{i} - {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ R_{k} ‖}_{2} {‖ a_{j} ‖}_{2} + {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ R_{k} - {\bar{R}}_{t_{k}}^{(s)} ‖}_{2} {‖ a_{j} ‖}_{2} + {‖ {\bar{a}}_{i}^{(s)} ‖}_{2} {‖ {\bar{R}}_{t_{k}}^{(s)} ‖}_{2} {‖ a_{j} - {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \leq {‖ a_{i} - {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ R_{k} ‖}_{F} {‖ a_{j} ‖}_{2} + {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ R_{k} - {\bar{R}}_{t_{k}}^{(s)} ‖}_{F} ‖ a_{j} ‖ + {‖ {\bar{a}}_{i}^{(s)} ‖}_{2} {‖ {\bar{R}}_{t_{k}}^{(s)} ‖}_{F} {‖ a_{j} - {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \leq 2 \frac{ω_{a}}{{(n K)}^{2}} ω_{a} ω_{r} + \frac{ω_{r}}{{(n K)}^{2}} ω_{a}^{2} = \frac{3 ω_{a}^{2} ω_{r}}{{(n K)}^{2}} .

(35)

Therefore, we have

\max_{s \in {1, \dots, s_{\max}}} \frac{1}{s} \sup_{\begin{matrix} a_{i} \in ℝ^{s}, R_{k} \in ℝ^{s \times s} \\ \max_{i} {‖ a_{i} ‖}_{2} \leq ω_{a} \\ \max_{k} {‖ vec (R_{k}) ‖}_{2} \leq ω_{r} \end{matrix}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) a_{i}^{T} R_{k} a_{j} | \leq \max_{s \in {1, \dots, s_{\max}}} \frac{1}{s} \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | + \sum_{i j k} | Y_{i j k} - π_{i j k}^{*} | \frac{3 ω_{a}^{2} ω_{r}}{n^{2} K^{2}} \leq \max_{s \in {1, \dots, s_{\max}}} \frac{1}{s} \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | + \frac{3 ω_{a}^{2} ω_{r}}{K} .

(36)

By Bernstein’s inequality (van der Vaart and Wellner, 1996, Lemma 2.2.9), we obtain for any t > 0,

\begin{matrix} max_{\underset{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{_{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}}}{s \in {1, \dots, s_{\max}}}} & \Pr (| \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > t) \leq 2 \exp (- \frac{t^{2}}{2 σ_{0}^{2} + 2 M_{0} t / 3}) \end{matrix},

(37)

where

M_{0} = \max_{s \in {1, \dots, s_{\max}}} \underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{\max_{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} \max_{i j k} | Y_{i j k} - π_{i j k}^{*} ‖ {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} |,

σ_{0}^{2} = \max_{s \in {1, \dots, s_{\max}}} \underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{\max_{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} \sum_{i j k} E {| Y_{i j k} - π_{i j k}^{*} |}^{2} {| {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} |}^{2} .

With some calculations, we can show that

\begin{matrix} M_{0} \leq & \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} & {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ {\bar{R}}_{t_{k}}^{(s)} ‖}_{2} {‖ {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \leq & \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} ‖ {\bar{R}}_{t_{k}}^{(s)} ‖_{F} {‖ {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \end{matrix} \begin{matrix} \leq & \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} & {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2} {‖ vec ({\bar{R}}_{t_{k}}^{(s)}) ‖}_{2} {‖ {\bar{a}}_{l_{j}}^{(s)} ‖}_{2} \leq ω_{a}^{2} ω_{r} \end{matrix},

(38)

and

σ_{0}^{2} \leq \max_{_{s \in {1, \dots, s_{\max}}}} \max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} \sum_{i j k} E {| Y_{i j k} - π_{i j k}^{*} |}^{2} {‖ {\bar{a}}_{l_{i}}^{(s)} ‖}_{2}^{2} {‖ vec ({\bar{R}}_{t_{k}}^{(s)}) ‖}_{2}^{2} {‖ {\bar{a}}_{l_{j}}^{(s)} ‖}_{2}^{2} \leq ω_{a}^{4} ω_{r}^{2} \sum_{i j k} E {| Y_{i j k} - π_{i j k}^{*} |}^{2} \leq ω_{a}^{4} ω_{r}^{2} n^{2} K .

(39)

Let $t_{s} = 5 ω_{a}^{2} ω_{r} s n \sqrt{K \max (n, K) \log (n K)}$ , we have

\max_{_{s \in {1, \dots, s_{\max}}}} \Pr (\max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > t_{s})

\leq \underset{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}}{max_{s \in {1, \dots, s_{\max}}}} N_{s, ε_{a}}^{n} N_{s^{2}, ε_{r}}^{K} \Pr (| \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > t_{s})

\leq \underset{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}}{max_{s \in {1, \dots, s_{\max}}}} {(3 n K)}^{2 s n + 2 s^{2} K} \Pr (| \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > t_{s})

\begin{array}{l} \leq \max_{s \in {1, \dots, s_{\max}}} 2 \exp (- \frac{25 ω_{a}^{4} ω_{r}^{2} s^{2} n^{2} K \max (n, K) \log (n K)}{2 ω_{a}^{4} ω_{r}^{2} {n^{2} K + s n \sqrt{K \max (n, K) \log (n K)} / 3}} + (2 s n + 2 s^{2} K) \log (3 n K)) \\ = \max_{s \in {1, \dots, s_{\max}}} 2 \exp (- \frac{25 s^{2} \max (n, K) \log (n K)}{2 + 2 s \sqrt{\max (n, K) \log (n K)} / (3 n \sqrt{K})} + (2 s n + 2 s^{2} K) \log (3 n K)), \end{array}

(40)

where the first inequality is due to Bonferroni’s inequality, the second inequality follows by (32) and (33), the third inequality is due to (37)-(39).

Under the given conditions, we have $s \leq s_{\max} = o {n / \log (n K)}$ and hence

\frac{s \sqrt{\max (n, K) \log (n K)}}{n \sqrt{K}} \leq \frac{s \sqrt{n \log (n K)}}{n \sqrt{K}} + \frac{s \sqrt{K \log (n K)}}{n \sqrt{K}} \leq \frac{s_{\max} \sqrt{\log (n K)}}{\sqrt{n}} + \frac{s_{\max} \sqrt{\log (n K)}}{n} ≪ 1.

It follows that for any $s \in {1, \dots, s_{\max}}$ ,

\begin{array}{r} \Pr (\max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} | \sum_{i j k} {(Y_{i j k} - π_{i j k}^{(s)})}^{T} {({\bar{a}}_{l_{j}}^{(s)})}^{T} {\bar{R}}_{l_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > 5 ω_{a}^{2} ω_{r} s n \sqrt{K \max (n, K) \log (n K)}) \\ \leq 2 \exp (- \frac{25 s^{2} \max (n, K) \log (n K)}{2 + 1} + 4 s^{2} \max (n, K) \log (3 n K)) \\ \leq 2 \exp {- 4 s^{2} \max (n, K) \log (n K)} \leq 2 \exp {- 4 s^{2} n \log (n K)} . \end{array}

By Bonferroni’s inequality, we have

\begin{array}{l} \Pr (\underset{s \in {1, \dots, s_{\max}}}{\cup} {\max_{\underset{t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}}}{l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}}}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} | > t_{s}}) \\ \leq \sum_{s = 1}^{s_{\max}} 2 \exp {- 4 s^{2} n \log (n K)} \leq \sum_{s = 1}^{+ \infty} 2 \exp {- 4 s n \log (n K)} \leq \frac{2 \exp {- 4 n \log (n K)}}{1 - \exp {- 4 n \log (n K)}} \to 0. \end{array}

This together with (36) implies that

\max_{s \in {1, \dots, s_{\max}}} \frac{1}{s} \sup_{\begin{matrix} a_{i} \in ℝ^{s}, R_{k} \in ℝ^{s \times s} \\ \max_{i} {‖ a_{i} ‖}_{2} \leq ω_{a} \\ \max_{k} {‖ vec (R_{k}) ‖}_{2} \leq ω_{r} \end{matrix}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) a_{i}^{T} R_{k} a_{j} | \leq 6 ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)},

(41)

with probability tending to 1. Combining this with (26) and (27), we obtain with probability tending to 1,

\max (\max_{s} | l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) | / s,

(42)

| l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) | / s_{0}) \leq 6 ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)} .

Therefore, it follows from (31) that

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s} \sum_{i j k} {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}

(43)

\leq \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} (l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}))

\leq \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} | l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) |

+ \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} | l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{n} (Y; {{\hat{a}}_{i}^{*}}_{i}, {{\hat{R}}_{k}^{*}}_{k}) |

+ \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} (l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}))

\leq \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} | l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) |

+ \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{8 \exp (ω_{a}^{2} ω_{r})}{s} | l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) |

\leq 96 ω_{a}^{2} ω_{r} \exp (ω_{a}^{2} ω_{r}) n \sqrt{K \max (n, K) \log (n K)},

where the third inequality is due to that $l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \geq l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k})$ , for all $s \in {s_{0}, \dots, s_{\max}}$ .

Let $r_{n, K} = {(n + K)}^{- 1 / 2} {(\log n + \log K)}^{- 1 / 2}$ . As $n \to \infty$ , we have

r_{n, K}^{2} n \sqrt{K \max (n, K) \log (n K)} \leq r_{n, K}^{2} n \sqrt{K (n + K) \log (n K)} = \frac{n \sqrt{K}}{\sqrt{(n + K) \log (n K)}} ≪ \sqrt{n K} .

Since $ω_{a}^{2} ω_{r} \leq \exp (ω_{a}^{2} ω_{r})$ , it follows from (43) that

\Pr (\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{r_{n, K}^{2}}{s^{2} \exp (2 ω_{a}^{2} ω_{r})} \sum_{i j k} {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2} \geq \sqrt{n K}) \to 0.

(44)

For any integer m ≥ 1, define

S_{m}^{(s)} = {({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) : m - 1 < \frac{r_{n, K}}{s \exp (ω_{a}^{2} ω_{r})} \sqrt{\sum_{i j k} {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}}^{2}} \leq m, a_{i}^{(s)} \in ℝ^{s}, \forall i, R_{k}^{(s)} \in ℝ^{s \times s}, \forall k, \max_{i} {‖ a_{i}^{(s)} ‖}_{2} \leq ω_{a}, \max_{k} {‖ vec (R_{k}^{(s)}) ‖}_{2} \leq ω_{r}} .

For any $({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \in S_{m}^{(s)}$ , similar to (31), we can show that

l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{0} ({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \geq \frac{\sum_{i j k} {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}}^{2}}{8 \exp (ω_{a}^{2} ω_{r})} \geq \frac{{(m - 1)}^{2} s^{2} r_{n, K}^{- 2}}{8 \exp (- ω_{a}^{2} ω_{r})} .

(45)

The event $({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \in S_{m}^{(s)}$ implies that

\sup_{({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \in S_{m}^{(s)}} l_{n} (Y; {a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \geq l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) .

It follows from (45) that

\sup_{({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \in S_{m}^{(s)}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}} | \geq \frac{{(m - 1)}^{2} s^{2} \exp (ω_{a}^{2} ω_{r})}{8 r_{n, K}^{2}} .

(46)

For any {l_i}_i and {t_k}_k satisfying (34), it follows from (35) that

\sum_{i j k} {θ_{i j k}^{*} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)}}^{2}

\leq \sum_{i j k} {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} + {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)}}^{2}

\begin{array}{l} \leq 2 \sum_{i j k} ({θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}}^{2} + {{(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)}}^{2}) \\ \leq \frac{2 m^{2} s^{2} \exp (2 ω_{a}^{2} ω_{r})}{r_{n, K}^{2}} + \frac{6 ω_{a}^{2} ω_{r}}{K} \leq \frac{3 m^{2} s^{2} \exp (2 ω_{a}^{2} ω_{r})}{r_{n, K}^{2}} . \end{array}

(47)

Let $Λ_{m}^{(s)} = {({l_{i}}_{i}, {t_{k}}_{k}) : \sum_{i j k} {θ_{i j k}^{*} - {({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)}}^{2} \leq 3 m^{2} s^{2} \exp (2 ω_{a}^{2} ω_{r}) / r_{n, K}^{2}}$ . Similar to (36), we can show

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} \sup_{({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \in S_{m}^{(s)}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}} | \leq \max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} \max_{\begin{matrix} \begin{array}{l} l_{1}, \dots, l_{n} \in {1, \dots, N_{s}, ε_{a}} \\ t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}} \end{array} \\ ({1_{i}}_{i}, {t_{k}}_{k}) \in Λ_{m}^{(s)} \end{matrix}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {{({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} - θ_{i j k}^{*}} | + \frac{3 ω_{a}^{2} ω_{r}}{K} .

Under the event defined in (46), for any m ≥ 9, we have

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} \max_{\begin{matrix} \begin{array}{l} l_{1}, \dots, l_{n} \in {1, \dots, N_{s}, ε_{a}} \\ t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}} \end{array} \\ ({1_{i}}_{i}, {t_{k}}_{k}) \in Λ_{m}^{(s)} \end{matrix}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {{({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} - θ_{i j k}^{*}} |

\geq \frac{{(m - 1)}^{2} \exp (ω_{a}^{2} ω_{r})}{8 r_{n, K}^{2}} - \frac{3 ω_{a}^{2} ω_{r}}{K} \geq \frac{{(m - 1)}^{2} \exp (ω_{a}^{2} ω_{r})}{16 r_{n, K}^{2}} + \frac{4 \exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}} - \frac{3 ω_{a}^{2} ω_{r}}{K} \geq \frac{{(m - 1)}^{2} \exp (ω_{a}^{2} ω_{r})}{16 r_{n, K}^{2}} .

Define

{(σ_{m}^{(s)})}^{2} = \max_{\begin{matrix} \begin{array}{l} l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{n, K}}} \\ t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{n, K}}} \end{array} \\ ({1_{i}}_{i}, {t_{k}}_{k}) \in Λ_{m}^{(s)} \end{matrix}} \sum_{i j k} E {| Y_{i j k} - π_{i j k}^{*} |}^{2} {{({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} - θ_{i j k}^{*}}^{2} .

By (47), it is immediate to see that

{(σ_{m}^{(s)})}^{2} \leq 3 m^{2} s^{2} \exp (2 ω_{a}^{2} ω_{r}) / r_{n, K}^{2} .

(48)

Similar to (37) and (40), we can show there exist some constants J₀ > 0, K₀ > 0 such that for any m ≥ J₀ and any s such that s₀ ≤ s ≤ s_max,

\Pr (\frac{1}{s^{2}} \max_{\begin{matrix} \begin{array}{l} l_{1}, \dots, l_{n} \in {1, \dots, N_{s, ε_{a}}} \\ t_{1}, \dots, t_{K} \in {1, \dots, N_{s^{2}, ε_{r}}} \end{array} \\ ({1_{i}}_{i}, {t_{k}}_{k}) \in Λ_{m}^{(s)} \end{matrix}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {{({\bar{a}}_{l_{i}}^{(s)})}^{T} {\bar{R}}_{t_{k}}^{(s)} {\bar{a}}_{l_{j}}^{(s)} - θ_{i j k}^{*}} | \geq \frac{{(m - 1)}^{2} \exp (ω_{a}^{2} ω_{r})}{16 r_{n, K}^{2}})

\leq 2 \exp (- \frac{{(m - 1)}^{4} s^{4} \exp (2 ω_{a}^{2} ω_{r}) / (256 r_{n, K}^{4})}{2 {(σ_{m}^{(s)})}^{2} + {(m - 1)}^{2} s^{2} ω_{a}^{2} ω_{r} \exp (ω_{a} ω_{r}) M_{0} / (24 r_{n, K}^{2})} + (2 s n + 2 s^{2} K) \log (3 n K))

\leq 2 \exp (- \frac{{(m - 1)}^{4} s^{4} / (256 r_{n, K}^{4})}{6 m^{2} s^{2} / r_{n, K}^{2} + {(m - 1)}^{2} s^{2} / (24 r_{n, K}^{2})} + (2 s n + 2 s^{2} K) \log (3 n K)) \leq 2 \exp (- K_{0} m^{2} s^{2} r_{n, K}^{- 2}),

where the second inequality is due to (38), (48) and the fact that $ω_{a}^{2} ω_{r} \leq \exp (ω_{a}^{2} ω_{r}) .$ This yields

\Pr (({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \in S_{m}^{(s)}) \leq 2 \exp (- K_{0} m^{2} s^{2} r_{n, K}^{- 2}) .

(49)

Let J_n,K be the integer such that $J_{n, K} - 1 \leq {(n K)}^{1 / 4} \leq J_{n, K} .$ In view of (44), we have for any m ≥ J₀,

\begin{array}{l} \Pr (\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{r_{n, K}}{s \exp (ω_{a}^{2} ω_{r})} \sqrt{\sum_{i j k} {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}} > m) \leq \sum_{s = s_{0}}^{s_{\max}} \sum_{l = m}^{J_{n, K}} 2 \exp (- K_{0} l^{2} s^{2} r_{n, K}^{- 2}) + o (1) \\ \leq \sum_{s = 1}^{+ \infty} \sum_{l = 1}^{+ \infty} 2 \exp (- K_{0} l {mss}_{0} r_{n, K}^{- 2}) + o (1) \leq \frac{2 \exp (- s_{0} m K_{0} r_{n, K}^{- 2})}{1 - 2 \exp (- s_{0} m K_{0} r_{n, K}^{- 2})} + o (1) = o (1) . \end{array}

This implies there exists some constant C₀ > 0 such that the following event occurs with probability tending to 1,

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} \sum_{i j k} {({\hat{θ}}_{i j k}^{(s)} - θ_{i j k}^{*})}^{2} \leq C_{0} \exp (ω_{a}^{2} ω_{r}) r_{n, K}^{- 2},

(50)

The proof is hence completed.

C.3. Proof of Lemma 3

For any s < s₀, we have

\sum_{i j k} {({({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} - {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*})}^{2} \geq \inf_{\begin{array}{l} a_{1}^{(s)}, \dots, a_{n}^{(s)} \in ℝ^{s} \\ R_{1}^{(s)}, \dots, R_{K}^{(s)} \in ℝ^{s \times s} \end{array}} \sum_{i j k} {({(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)} - {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*})}^{2}

(51)

= \inf_{\begin{matrix} A^{(s)} \in ℝ^{n \times s} \\ R_{1}^{(s)}, \dots, R_{K}^{(s)} \in ℝ^{s \times s} \end{matrix}} \sum_{k = 1}^{K} {‖ A^{(s)} R_{k}^{(s)} {(A^{(s)})}^{T} - A_{0} R_{k, 0} A_{0}^{T} ‖}_{F}^{2}

= \inf_{\begin{array}{c} A^{(s)} \in ℝ^{n \times s} \\ R_{1}^{(s)}, \dots, R_{K}^{(s)} \in ℝ^{s \times s} \end{array}} {‖ (\begin{matrix} A^{(s)} R_{1}^{(s)} \\ ⋮ \\ A^{(s)} R_{K}^{(s)} \end{matrix}) {(A^{(s)})}^{T} - \underset{B_{0}}{\underset{︸}{(\begin{matrix} A_{0} R_{1, 0} \\ ⋮ \\ A_{0} R_{K, 0} \end{matrix})}} A_{0}^{T} ‖}_{F}^{2}

\geq \inf_{A^{(s)} \in ℝ^{n \times s}, B^{(s)} \in ℝ^{n K \times s}} {‖ B^{(s)} {(A^{(s)})}^{T} - B_{0} A_{0}^{T} ‖}_{F}^{2} .

Define

({\hat{A}}^{(s)}, {\hat{B}}^{(s)}) = \arg \min_{\begin{array}{c} A^{(s)} \in ℝ^{n \times s} \\ B^{(s)} \in ℝ^{n K \times s} \end{array}} {‖ B^{(s)} {(A^{(s)})}^{T} - B_{0} A_{0}^{T} ‖}_{F}^{2} .

The above minimizers are not unique. Notice that rank $(B_{0} A_{0}^{T}) \leq rank (A_{0}^{T}) \leq s_{0} .$ . Assume $B_{0} A_{0}^{T}$ has the following singular value decomposition,

B_{0} A_{0}^{T} = n U_{n} Λ_{n} V_{n}^{T},

for some $U_{n} \in ℝ^{n K \times s_{0}}, V_{n} \in ℝ^{n \times s_{0}} such that U_{n}^{T} U_{n} = V_{n}^{T} V_{n} = I_{s_{0}},$ and some diagonal matrix

Λ_{n} = diag (λ_{n}^{(1)}, λ_{n}^{(2)}, \dots, λ_{n}^{(s_{0})})

such that $| λ_{n}^{(1)} | \geq | λ_{n}^{(2)} | \geq \dots \geq | λ_{n}^{(s_{0})} | .$ Then one solution is given by

{\hat{A}}^{(s)} = n V_{n} Λ_{n} U_{n}^{T} U_{n}^{(s)}, {\hat{B}}^{(s)} = U_{n}^{(s)},

where $U_{n}^{(s)}$ is the submatrix of U_n formed by its first s columns.

Since $U_{n}^{T} U_{n} = I_{s_{0}},$ we have

{\hat{B}}^{(s)} {({\hat{A}}^{(s)})}^{T} = n U_{n}^{(s)} {(U_{n}^{(s)})}^{T} U_{n} Λ_{n} V_{n}^{T} = n U_{n}^{(s)} (I_{s}, O_{s, s_{0} - s}) Λ_{n} V_{n}^{T} = n U_{n} (\begin{matrix} I_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & O_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n} V_{n}^{T},

and hence

{\hat{B}}^{(s)} {({\hat{A}}^{(s)})}^{T} - B_{0} A_{0}^{T} = n U_{n} (\begin{matrix} I_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & O_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n} V_{n}^{T} - n U_{n} I_{s_{0}} Λ_{n} V_{n}^{T} = - n U_{n} (\begin{matrix} O_{s} & O_{s, s 0 - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n} V_{n}^{T} .

This together with (51) implies that

\begin{array}{l} \sum_{i j k} {({({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} - {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*})}^{2} \geq {‖ {\hat{B}}^{(s)} {({\hat{A}}^{(s)})}^{T} - B_{0} A_{0}^{T} ‖}_{F}^{2} \\ \geq n^{2} trace (U_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n} V_{n}^{T} V_{n} Λ_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) U_{n}^{T}) \\ = n^{2} trace (U_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n}^{2} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) U_{n}^{T}) \end{array}

(52)

\begin{array}{l} = n^{2} trace (Λ_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) U_{n}^{T} U_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n}) \\ = n^{2} trace (Λ_{n} (\begin{matrix} O_{s} & O_{s, s_{0} - s} \\ O_{s_{0} - s, s} & I_{s_{0} - s, s_{0} - s} \end{matrix}) Λ_{n}) \geq n^{2} {(λ_{n}^{(s_{0})})}^{2}, \end{array}

(53)

where (52) is due to that $V_{n}^{T} V_{n} = I_{s_{0}}$ and the equality in (53) is due to that $U_{n}^{T} U_{n} = I_{s_{0}} .$

To summarize, we’ve shown

\sum_{i j k} {({({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} - {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*})}^{2} \geq n^{2} {(λ_{n}^{(s_{0})})}^{2} .

(54)

In the following, we provide a lower bound for ${(λ_{n}^{(s_{0})})}^{2} .$ By definition, ${(λ_{n}^{(s_{0})})}^{2}$ is the s₀-th largest eigenvalue of

\frac{1}{n^{2}} A_{0} B_{0}^{T} B_{0} A_{0}^{T} = \frac{1}{n^{2}} \sum_{k = 1}^{K} A_{0} R_{k, 0}^{T} A_{0}^{T} A_{0} R_{k, 0} A_{0}^{T} .

We first provide an lower bound for $λ_{\min} (\sum_{k = 1}^{K} R_{k, 0}^{T} A_{0}^{T} A_{0} R_{k, 0} / n) .$ Let $Σ_{A} = A_{0}^{T} A_{0} / n .$ Consider the following eigenvalue decomposition:

Σ_{A} = U_{A} Λ_{A} U_{A}^{T},

for some orthogonal matrix U_A and some diagonal matrix $Λ_{A} .$ Under Assumption (A2), the matrix

Σ_{A} - \bar{c} I_{s_{0}}

is positive semidefinite. As a result, the matrix

\sum_{k = 1}^{K} R_{k, 0}^{T} (Σ_{A} - \bar{c} I_{s_{0}}) R_{k, 0}

is positive semidefinite. Therefore, we have

λ_{\min} (\sum_{k = 1}^{K} R_{k, 0}^{T} Σ_{A} R_{k, 0}) = \inf_{a_{0} \in ℝ^{s_{0}}, {‖ a_{0} ‖}_{2} = 1} a_{0}^{T} (\sum_{k = 1}^{K} R_{k, 0}^{T} Σ_{A} R_{k, 0}) a_{0}

(55)

\begin{array}{l} = \inf_{a_{0} \in ℝ^{s_{0}}, {‖ a_{0} ‖}_{2} = 1} {a_{0}^{T} (\sum_{k = 1}^{K} R_{k, 0}^{T} (Σ_{A} - \bar{c} I_{s_{0}}) R_{k, 0}) a_{0} + \bar{c} a_{0}^{T} (\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0}) a_{0}} \\ \geq \bar{c} \inf_{a_{0} \in ℝ^{s_{0}}, {‖ a_{0} ‖}_{2} = 1} a_{0}^{T} (\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0}) a_{0} = \bar{c} λ_{\min} (\sum_{k = 1}^{K} R_{k, 0}^{T} R_{k, 0}) = \bar{c} \bar{K} . \end{array}

By the eigenvalue decomposition, we have

\sum_{k = 1}^{K} R_{k, 0}^{T} Σ_{A} R_{k, 0} = U_{R A}^{T} Λ_{R A} U_{R A},

for some orthogonal matrix $Λ_{R A} \in ℝ^{s_{0} \times s_{0}}$ and some diagonal matrix $U_{R A} \in ℝ^{s_{0} \times s_{0}}$ . It follows from (55) that all the diagonal elements in $Λ_{R A}$ are positive. Let $Λ_{R A}^{1 / 2}$ be the diagonal matrix such that $Λ_{R A}^{1 / 2} Λ_{R A}^{1 / 2} = Λ_{R A} .$ Apparently, the diagonal elements in $Λ_{R A}^{1 / 2}$ are nonzero. Notice that

\frac{1}{n^{2}} \sum_{k = 1}^{K} A_{0} R_{k, 0}^{T} A_{0}^{T} A_{0} R_{k, 0} A_{0}^{T} = \frac{1}{n} A_{0} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} A_{0}^{T} .

The s₀ largest eigenvalues in $\frac{1}{n^{2}} \sum_{k = 1}^{K} A_{0} R_{k, 0}^{T} A_{0}^{T} A_{0} R_{k, 0} A_{0}^{T}$ corresponds to the smallest eigenvalue in

\frac{1}{n} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} A_{0}^{T} A_{0} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} .

Similar to (55), we can show that

λ_{\min} (\frac{1}{n} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} A_{0}^{T} A_{0} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A}) \geq \bar{c} λ_{\min} (U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A} U_{R A}^{T} Λ_{R A}^{1 / 2} U_{R A}) = \bar{c} λ_{\min} (U_{R A}^{T} Λ_{R A} U_{R A}) \geq {\bar{c}}^{2} λ_{\min} (\sum_{k = 1}^{K} R_{k, 0}^{T} Σ_{A} R_{k, 0}) .

Combining this together with (55), we obtain that

{(λ_{n, k}^{(s_{0})})}^{2} \geq {\bar{c}}^{2} \bar{K} .

It follows from (54) that

\sum_{i j k} {({({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)} - {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*})}^{2} \geq n^{2} {\bar{c}}^{2} \bar{K} .

This completes the proof.

C.4. Proof of Theorem 1

It suffices to show

\Pr (IC (s_{0}) > \max_{1 \leq s < s_{0}} IC (s)) \to 1,

(56)

and

\Pr (IC (s_{0}) > \max_{s_{0} \leq s \leq s_{\max}} IC (s)) \to 1.

(57)

We first show (56). Combining Lemma 3 with (31), we obtain that

2 l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \geq \frac{\sum_{i j k} {(θ_{i j k}^{*} - {\hat{θ}}_{i j k}^{(s)})}^{2}}{4 \exp (ω_{a}^{2} ω_{r})} \geq \frac{n^{2} {\bar{c}}^{2} \bar{K}}{4 \exp (ω_{a}^{2} ω_{r})},

for any $s \in {1, \dots, s_{0} - 1} .$ Combining this with (42), we have that

2 l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \geq 2 l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - 2 | l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) | - 2 | l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) | \geq \frac{{\bar{c}}^{2} n^{2} \bar{K}}{4 \exp (ω_{a}^{2} ω_{r})} - 24 ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)},

(58)

with probability tending to 1.

Under the given conditions, we have $K ~ n^{l_{0}} for some 0 \leq l_{0} \leq 1$ . This together with the condition $n^{(1 - l_{0}) / 2} \bar{K} ≫ \exp (2 ω_{a}^{2} ω_{r}) \sqrt{\log n}$ yields

ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)} = O (\exp (ω_{a}^{2} ω_{r}) n^{3 / 2 + l_{0} / 2} \sqrt{\log n}) ≪ \frac{n^{2} \bar{K}}{\exp (ω_{a}^{2} ω_{r})} .

(59)

By (58), we have with probability tending to 1 that

2 l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \geq \frac{{\bar{c}}^{2} n^{2} \bar{K}}{8 \exp (ω_{a}^{2} ω_{r})},

(60)

for all 1 ≤ s < s₀. By definition, we have

l_{n} (Y; {{\hat{a}}_{i}^{(s_{0})}}_{i}, {{\hat{R}}_{k}^{(s_{0})}}_{k}) \geq l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) .

This together with (60) gives that for all 1 ≥ s < s₀,

2 l_{n} (Y; {{\hat{a}}_{i}^{(s_{0})}}_{i}, {{\hat{R}}_{k}^{(s_{0})}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \geq \frac{{\bar{c}}^{2} n^{2} \bar{K}}{8 \exp (ω_{a}^{2} ω_{r})},

(61)

with probability tending to 1.

Under the given conditions, we have $κ (n, K) ≪ n^{2} \bar{K} / \exp (ω_{a}^{2} ω_{r}) .$ Under the event defined in (61), we have that

IC (s_{0}) - IC (s) = 2 l_{n} (Y; {{\hat{a}}_{i}^{(s_{0})}}_{i}, {{\hat{R}}_{k}^{(s_{0})}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - (s_{0} - s) κ (n, K) \geq \frac{{\bar{c}}^{2} n^{2} \bar{K}}{8 \exp (ω_{a}^{2} ω_{r})} - s_{0} κ (n, K) ≫ 0,

since s₀ is fixed. This proves (56).

Now we show (57). Similar to (37)-(42), we can show the following event occurs with probability tending to 1,

\sup_{s \in {s_{0}, \dots, s_{\max}} a_{1}^{(s)}, \dots, a_{n}^{(s)} \in Ω_{a}, R_{1}^{(s)}, \dots, R_{K}^{(s)} \in Ω_{r} \frac{d {{a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k})}{s \exp (ω_{a}^{2} ω_{r})} = O (\frac{r_{n, K}^{- 1}}{n \sqrt{K}})} \frac{1}{s^{2}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}} | = O (\frac{\exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}) .

(62)

By Lemma 2, we obtain with probability tending to 1,

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s^{2}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}} | = O (\frac{\exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}),

and hence

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}} | = O (\frac{s_{\max} \exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}) .

(63)

Since $l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \geq l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}),$ under the event defined in (63), we have

2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - 2 l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \leq 2 l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - 2 l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {({\hat{a}}_{i}^{(s)})}^{T} {\hat{R}}_{k}^{(s)} {\hat{a}}_{j}^{(s)}} | = O (\frac{s s_{\max} \exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}) .

Notice that

l_{0} ({{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) \leq l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}),

we have with probability tending to 1 that

2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s_{0})}}_{i}, {{\hat{R}}_{k}^{(s_{0})}}_{k}) \leq O (\frac{s s_{\max} \exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}),

(64)

for all $s_{0} < s \leq s_{max} .$ Under the event defined in (64), we have

IC (s_{0}) - IC (s) \geq 2 l_{n} (Y; {{\hat{a}}_{i}^{(s_{0})}}_{i}, {{\hat{R}}_{k}^{(s_{0})}}_{k}) - 2 l_{n} (Y; {{\hat{a}}_{i}^{(s)}}_{i}, {{\hat{R}}_{k}^{(s)}}_{k}) + (s - s_{0}) κ (n, K) \geq (s - s_{0}) κ (n, K) - O (s s_{\max} \exp (ω_{a}^{2} ω_{r}) r_{n, K}^{- 2}) .

Under the condition $κ (n, K) ≫ s_{\max} \exp (ω_{a}^{2} ω_{r}) (n + K) (\log n + \log K),$ we have that

\Pr (IC (s_{0}) > \max_{s_{0} < s \leq s_{\max}} IC (s)) \to 1,

This proves (57). The proof is hence completed.

C.5. Proof of Corollary 1

Using similar arguments in Lemma 3, we can show for any permutation function π : ${1, \dots, n} \to {1, \dots, n},$

\sum_{i j k} {({({\hat{a}}_{i, π}^{(s)})}^{T} {\hat{R}}_{k, π}^{(s)} {\hat{a}}_{j, π}^{(s)} - {(a_{i, π}^{*})}^{T} R_{k, π}^{*} a_{j, π}^{*})}^{2} \geq n^{2} {\bar{c}}^{2} \bar{K} .

(65)

In addition, it follows from (41) and the condition $K = O (n^{l_{0}})$ that

\Pr (\max_{\begin{matrix} s \in {1, \dots, s_{\max}} \\ π : {1, \dots, n} \to {1, \dots, n} \end{matrix}} \frac{1}{s} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {({\hat{a}}_{i, π}^{(s)})}^{T} {\hat{R}}_{k, π}^{(s)} {\hat{a}}_{j, π}^{(s)} | \leq C_{0} ω_{a}^{2} ω_{r} n^{3 / 2 + l_{0} / 2} \sqrt{\log n}) \to 1,

and

\Pr (\frac{1}{s_{0}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {(a_{i}^{*})}^{T} R_{k}^{*} a_{j}^{*} | \leq C_{0} ω_{a}^{2} ω_{r} n^{3 / 2 + l_{0} / 2} \sqrt{\log n}) \to 1,

for some constant C₀ > 0. Hence, the following event occurs with probability tending to 1,

\max (\max_{s, π} | l_{0} ({{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) - l_{n} (Y; {{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) | / s,

(66)

| l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) | / s_{0}) \leq C_{0} ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)} .

Combining (65) with (31) yields

2 l_{0} ({a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 \max_{\begin{matrix} s \in {1, \dots, s_{0} - 1} \\ π : {1, \dots, n} \to {1, \dots, n} \end{matrix}} l_{0} ({{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) \geq \frac{n^{2} {\bar{c}}^{2} \bar{K}}{4 \exp (ω_{a}^{2} ω_{r})} .

(67)

Under the given conditions, we have

\frac{n^{2} {\bar{c}}^{2} \bar{K}}{4 \exp (ω_{a}^{2} ω_{r})} ≫ ω_{a}^{2} ω_{r} n \sqrt{K \max (n, K) \log (n K)} .

This together with (66) and (67) gives

2 l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) - 2 \max_{\begin{matrix} s \in {1, \dots, s_{0} - 1} \\ π : {1, \dots, n} \to {1, \dots, n} \end{matrix}} l_{n} (Y; {{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}) \geq \frac{n^{2} {\bar{c}}^{2} \bar{K}}{8 \exp (ω_{a}^{2} ω_{r})},

with probability tending to 1.

Under Condition (A4), we have $l_{n} (Y; {{\hat{a}}_{i, π}^{(s_{0})}}_{i}, {{\hat{R}}_{k, π}^{(s_{0})}}_{k}) \geq l_{n} (Y; {a_{i, π}^{(s_{0}) *}}_{i}, {R_{k, π}^{(s_{0}) *}}) = l_{n} (Y; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) .$ Therefore, the following event occurs with probability tending to 1,

\min_{π : {1, \dots, n} \to {1, \dots, n}} 2 (l_{n} (Y; {{\hat{a}}_{i, π}^{(s_{0})}}_{i}, {{\hat{R}}_{k, π}^{(s_{0})}}_{k}) - \max_{s \in {1, \dots, s_{0} - 1}} l_{n} (Y; {{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k})) \geq \frac{n^{2} {\bar{c}}^{2} \bar{K}}{\exp (ω_{a}^{2} ω_{r})} .

Under the given conditions in Corollary 1, we obtain

\Pr (\underset{π : {1, \dots, n} \to {1, \dots, n}}{\cap} {{IC}_{π} (s_{0}) > \max_{1 \leq s \leq s_{0} - 1} {IC}_{π} (s)}) \to 1.

(68)

Similar to (46) and (49), we can show

\Pr (\underset{π : {1, \dots, n} \to {1, \dots, n}}{\cup} {({{\hat{a}}_{i, π}^{(s)}}, {\hat{R}}_{k, π}^{(s)}) \in S_{m}^{(s)}}) \leq \Pr (\sup_{({a_{i}^{(s)}}_{i}, {R_{k}^{(s)}}_{k}) \in S_{m}^{(s)}} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {(a_{i}^{(s)})}^{T} R_{k}^{(s)} a_{j}^{(s)}} | \geq \frac{{(m - 1)}^{2} s^{2} \exp (ω_{a}^{2} ω_{r})}{8 r_{n, K}^{2}}) \leq 2 \exp (- K_{0} m^{2} s^{2} r_{n, K}^{- 2}),

for some constant K₀ > 0. Using similar arguments in the proof of Lemma 2, this yields with probability tending to 1,

\max_{\begin{array}{c} s \in {s_{0}, \dots, s_{\max}} \\ π : {1, \dots, n} \to {1, \dots, n} \end{array}} \frac{1}{s^{2}} d^{2} ({{\hat{a}}_{i, π}^{(s)}}_{i}, {{\hat{R}}_{k, π}^{(s)}}_{k}; {a_{i}^{*}}_{i}, {R_{k}^{*}}_{k}) \leq \frac{\exp (2 ω_{a}^{2} ω_{r}) (n + K) (\log n + \log K)}{n^{2} K} .

This together with (62) gives

\max_{s \in {s_{0}, \dots, s_{\max}}} \frac{1}{s} | \sum_{i j k} (Y_{i j k} - π_{i j k}^{*}) {θ_{i j k}^{*} - {({\hat{a}}_{i, π}^{(s)})}^{T} {\hat{R}}_{k, π}^{(s)} {\hat{a}}_{j, π}^{(s)}} | = O (\frac{s_{\max} \exp (ω_{a}^{2} ω_{r})}{r_{n, K}^{2}}),

with probability tending to 1. Using similar arguments in the proof of Theorem 1, we can show

\Pr (\underset{π : {1, \dots, n} \to {1, \dots, n}}{\cap} {{IC}_{π} (s_{0}) > \max_{s_{0} < s \leq s_{\max}} {IC}_{π} (s)}) \to 1 .

This together with (68) yields (16). The proof is hence completed.

Appendix D. Additional Simulation Results

We simulate the response ${Y_{i j k}}_{i j k}$ from the following model:

\Pr (Y_{i j k} = 1 | {a_{i}}_{i}, {R_{k}}_{k}) = \frac{\exp (a_{i}^{T} R_{k} a_{j})}{1 + \exp (a_{i}^{T} R_{k} a_{j})}, a_{1}, a_{2}, \dots, a_{n} \overset{i i d}{~} N (0, {{0.5}^{| i - j |}}_{i, j = 1, \dots, s_{0}}), R_{1} = R_{2} = \dots = R_{K} = diag (\underset{s_{0}}{\underset{︸}{1, - 1, 1, - 1, \dots, 1, - 1}}) .

We use the same six settings as described in Section 4.2. In each setting, we further consider three scenarios, by setting s₀ = 2, 4 and 6. Reported in Table 3 and 4 are the percentage of selecting the true models (TP) and the average of $\hat{s}$ selected by IC₀, IC_0.5, IC₁ and BIC over 100 replications.

Table 4:

Simulation results for Setting I, II and III (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 6
n = 100, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	1.00(0.00)	2.00(0.00)	0.96(0.02)	3.98(0.02)	0.88(0.03)	5.87(0.04)
IC_0.5	1.00(0.00)	2.00(0.00)	0.96(0.02)	3.98(0.02)	0.88(0.03)	5.87(0.04)
IC₁	1.00(0.00)	2.00(0.00)	0.96(0.02)	3.98(0.02)	0.85(0.04)	5.81(0.05)
BIC	0.00(0.00)	11.98(0.01)	0.00(0.00)	11.99(0.01)	0.00(0.00)	12.00(0.00)

n = 150, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.94(0.02)	6.04(0.02)
IC_0.5	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.94(0.02)	6.04(0.02)
IC₁	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.94(0.02)	6.04(0.02)
BIC	0.00(0.00)	12.00(0.00)	0.00(0.00)	12.00(0.00)	0.00(0.00)	11.99(0.01)

n = 200, K = 3	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.98(0.01)	6.02(0.01)
IC_0.5	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.98(0.01)	6.02(0.01)
IC₁	1.00(0.00)	2.00(0.00)	0.97(0.02)	4.03(0.02)	0.98(0.01)	6.02(0.01)
BIC	0.00(0.00)	12.00(0.00)	0.00(0.00)	12.00(0.00)	0.00(0.00)	11.99(0.01)

Open in a new tab

It can be seen from Table 4 and 5 that our proposed information criteria are consistent. In contrast, BIC fails in all settings. In addition, in the last two settings, TPs of IC_0.5 and IC₁ are larger than IC₀ for most of the cases. As commented before, these differences are due to the finite sample correction term τ_α(n, K).

Table 5:

Simulation results for Setting IV, V and VI (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 6
n = 50, K = 10	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	1.00(0.00)	2.00(0.00)	0.96(0.02)	3.98(0.02)	0.73(0.04)	5.83(0.06)
IC_0.5	1.00(0.00)	2.00(0.00)	0.95(0.02)	3.97(0.02)	0.69(0.05)	5.77(0.06)
IC₁	1.00(0.00)	2.00(0.00)	0.93(0.03)	3.93(0.03)	0.63(0.05)	5.57(0.07)
BIC	0.00(0.00)	11.83(0.05)	0.00(0.00)	11.82(0.04)	0.00(0.00)	11.86(0.04)

n = 50, K = 20	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.98(0.01)	2.02(0.01)	0.90(0.03)	4.10(0.03)	0.76(0.04)	6.06(0.05)
IC_0.5	0.98(0.01)	2.02(0.01)	0.94(0.02)	3.98(0.02)	0.81(0.04)	5.99(0.04)
IC₁	0.98(0.01)	2.02(0.01)	0.94(0.02)	3.94(0.02)	0.74(0.04)	5.81(0.05)
BIC	0.00(0.00)	12.00(0.00)	0.00(0.00)	12.00(0.00)	0.00(0.00)	11.99(0.01)

n = 50, K = 50	TP	$\hat{s}$	TP	$\hat{s}$	TP	$\hat{s}$
IC₀	0.96(0.02)	2.04(0.02)	0.88(0.03)	4.12(0.03)	0.68(0.05)	6.57(0.13)
IC_0.5	0.98(0.01)	2.02(0.01)	0.94(0.02)	4.04(0.02)	0.82(0.04)	6.06(0.04)
IC₁	0.98(0.01)	2.02(0.01)	0.94(0.02)	4.02(0.02)	0.74(0.04)	5.75(0.05)
BIC	0.00(0.00)	12.00(0.00)	0.00(0.00)	12.00(0.00)	0.00(0.00)	12.00(0.00)

Open in a new tab

References

Allen Genevera. Sparse higher-order principal components analysis. In International Conference on Artificial Intelligence and Statistics, pages 27–36, 2012. [Google Scholar]
Audibert Jean-Yves and Tsybakov Alexandre B. Fast learning rates for plug-in classifiers. Ann. Statist, 35(2):608–633, 2007. ISSN 0090–5364. doi: 10.1214/009053606000001217. URL 10.1214/009053606000001217. [DOI] [Google Scholar]
Bai Jushan and Ng Serena. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002. ISSN 0012–9682. doi: 10.1111/1468-0262.00273. URL 10.1111/1468-0262.00273. [DOI] [Google Scholar]
Banerjee Sudipto and Roy Anindya. Linear algebra and matrix analysis for statistics. CRC Press, 2014. [Google Scholar]
Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011. [Google Scholar]
Chi Eric C. and Kolda Tamara G. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl, 33(4):1272–1299, 2012. ISSN 0895–4798. URL https: //doi.org/10.1137/110859063. [Google Scholar]
Fan Yingying and Tang Cheng Yong. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(3):531–552, 2013. ISSN 1369–7412. doi: 10.1111/rssb.12001. URL 10.1111/rssb.12001. [DOI] [Google Scholar]
Galassi Mark, Davies Jim, Theiler James, Gough Brian, Jungman Gerard, Alken Patrick, Booth Michael, Rossi Fabrice, and Ulerich Rhys. GNU Scientific Library Reference Manual (Version 2.1), 2015. URL http://www.gnu.org/software/gsl/. [Google Scholar]
Getoor Lise and Mihalkova Lilyana. Learning statistical models from relational data. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1195–1198. ACM, 2011. [Google Scholar]
Harshman Richard A. Models for analysis of asymmetrical relationships among n objects or stimuli. In Paper presented at the First Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology, Hamilton, Ontario, August, 1978. [Google Scholar]
Harshman Richard A and Lundy Margaret E. Parafac: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994. [Google Scholar]
Kemp Charles, Tenenbaum Joshua B, Griffiths Thomas L, Yamada Takeshi, and Ueda Naonori. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006. [Google Scholar]
Kok Stanley and Domingos Pedro. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433–440. ACM, 2007. [Google Scholar]
Kolda Tamara G. and Bader Brett W. Tensor decompositions and applications. SIAM Rev, 51(3):455–500, 2009. ISSN 0036–1445. URL 10.1137/07070111X. [DOI] [Google Scholar]
Madan Anmol, Cebrian Manuel, Moturu Sai, Farrahi Katayoun, et al. Sensing the” health state” of a community. IEEE Pervasive Computing, 11(4):36–45, 2012. [Google Scholar]
Mendelson Shahar, Pajor Alain, and Tomczak-Jaegermann Nicole. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): 277–289, 2008. [Google Scholar]
Nickel Maximilian. Tensor factorization for relational learning. PhD thesis, the Ludwig-Maximilians-University of Munich, 2013. [Google Scholar]
Nickel Maximilian and Tresp Volker. Logistic tensor factorization for multi-relational data. arXiv preprint arXiv:1306.2084, 2013. [Google Scholar]
Nickel Maximilian, Tresp Volker, and Kriegel Hans-Peter. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 809–816, 2011. [Google Scholar]
Nickel Maximilian, Murphy Kevin, Tresp Volker, and Gabrilovich Evgeniy. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. [Google Scholar]
Nowicki Krzysztof and Snijders Tom A. B. Estimation and prediction for stochastic block-structures. J. Amer. Statist. Assoc, 96(455):1077–1087, 2001. ISSN 0162–1459. doi: 10.1198/016214501753208735. URL 10.1198/016214501753208735. [DOI] [Google Scholar]
Richardson Matthew and Domingos Pedro. Markov logic networks. Machine learning, 62 (1):107–136, 2006. [Google Scholar]
Schwarz Gideon. Estimating the dimension of a model. Ann. Statist, 6(2):461–464, 1978. ISSN 0090–5364. URL http://links.jstor.org/sici?sici=0090-5364(197803)6:2<461:ETDOAM>2.0.CO;2-5&origin=MSN. [Google Scholar]
Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. J. R. Stat. Soc. Ser. B. Stat. Methodol, 79(3):899–916, 2017. ISSN 1369–7412. URL 10.1111/rssb.12190. [DOI] [Google Scholar]
Tsybakov Alexandre B. Optimal aggregation of classifiers in statistical learning. Ann. Statist, 32(1):135–166, 2004. ISSN 0090–5364. doi: 10.1214/aos/1079120131. URL 10.1214/aos/1079120131. [DOI] [Google Scholar]
Tucker Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]
van der Vaart Aad W. and Wellner Jon A. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. ISBN 0–387-94640–3. doi: 10.1007/978-1-4757-2545-2. URL 10.1007/978-1-4757-2545-2. With applications to statistics. [DOI] [Google Scholar]
Wang Yu, Yin Wotao, and Zeng Jinshan. Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2017. [Google Scholar]
Yang Yun and Dunson David B. Bayesian conditional tensor factorizations for highdimensional classification. J. Amer. Statist. Assoc, 111(514):656–669, 2016. ISSN 01621459. URL 10.1080/01621459.2015.1029129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Xiang, Wu Yichao, Wang Lan, and Li Runze. A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res, 17(1):1–26, 2016. ISSN 1532–4435. [PMC free article] [PubMed] [Google Scholar]

[R1] Allen Genevera. Sparse higher-order principal components analysis. In International Conference on Artificial Intelligence and Statistics, pages 27–36, 2012. [Google Scholar]

[R2] Audibert Jean-Yves and Tsybakov Alexandre B. Fast learning rates for plug-in classifiers. Ann. Statist, 35(2):608–633, 2007. ISSN 0090–5364. doi: 10.1214/009053606000001217. URL 10.1214/009053606000001217. [DOI] [Google Scholar]

[R3] Bai Jushan and Ng Serena. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002. ISSN 0012–9682. doi: 10.1111/1468-0262.00273. URL 10.1111/1468-0262.00273. [DOI] [Google Scholar]

[R4] Banerjee Sudipto and Roy Anindya. Linear algebra and matrix analysis for statistics. CRC Press, 2014. [Google Scholar]

[R5] Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011. [Google Scholar]

[R6] Chi Eric C. and Kolda Tamara G. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl, 33(4):1272–1299, 2012. ISSN 0895–4798. URL https: //doi.org/10.1137/110859063. [Google Scholar]

[R7] Fan Yingying and Tang Cheng Yong. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(3):531–552, 2013. ISSN 1369–7412. doi: 10.1111/rssb.12001. URL 10.1111/rssb.12001. [DOI] [Google Scholar]

[R8] Galassi Mark, Davies Jim, Theiler James, Gough Brian, Jungman Gerard, Alken Patrick, Booth Michael, Rossi Fabrice, and Ulerich Rhys. GNU Scientific Library Reference Manual (Version 2.1), 2015. URL http://www.gnu.org/software/gsl/. [Google Scholar]

[R9] Getoor Lise and Mihalkova Lilyana. Learning statistical models from relational data. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1195–1198. ACM, 2011. [Google Scholar]

[R10] Harshman Richard A. Models for analysis of asymmetrical relationships among n objects or stimuli. In Paper presented at the First Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology, Hamilton, Ontario, August, 1978. [Google Scholar]

[R11] Harshman Richard A and Lundy Margaret E. Parafac: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994. [Google Scholar]

[R12] Kemp Charles, Tenenbaum Joshua B, Griffiths Thomas L, Yamada Takeshi, and Ueda Naonori. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006. [Google Scholar]

[R13] Kok Stanley and Domingos Pedro. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433–440. ACM, 2007. [Google Scholar]

[R14] Kolda Tamara G. and Bader Brett W. Tensor decompositions and applications. SIAM Rev, 51(3):455–500, 2009. ISSN 0036–1445. URL 10.1137/07070111X. [DOI] [Google Scholar]

[R15] Madan Anmol, Cebrian Manuel, Moturu Sai, Farrahi Katayoun, et al. Sensing the” health state” of a community. IEEE Pervasive Computing, 11(4):36–45, 2012. [Google Scholar]

[R16] Mendelson Shahar, Pajor Alain, and Tomczak-Jaegermann Nicole. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): 277–289, 2008. [Google Scholar]

[R17] Nickel Maximilian. Tensor factorization for relational learning. PhD thesis, the Ludwig-Maximilians-University of Munich, 2013. [Google Scholar]

[R18] Nickel Maximilian and Tresp Volker. Logistic tensor factorization for multi-relational data. arXiv preprint arXiv:1306.2084, 2013. [Google Scholar]

[R19] Nickel Maximilian, Tresp Volker, and Kriegel Hans-Peter. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 809–816, 2011. [Google Scholar]

[R20] Nickel Maximilian, Murphy Kevin, Tresp Volker, and Gabrilovich Evgeniy. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. [Google Scholar]

[R21] Nowicki Krzysztof and Snijders Tom A. B. Estimation and prediction for stochastic block-structures. J. Amer. Statist. Assoc, 96(455):1077–1087, 2001. ISSN 0162–1459. doi: 10.1198/016214501753208735. URL 10.1198/016214501753208735. [DOI] [Google Scholar]

[R22] Richardson Matthew and Domingos Pedro. Markov logic networks. Machine learning, 62 (1):107–136, 2006. [Google Scholar]

[R23] Schwarz Gideon. Estimating the dimension of a model. Ann. Statist, 6(2):461–464, 1978. ISSN 0090–5364. URL http://links.jstor.org/sici?sici=0090-5364(197803)6:2<461:ETDOAM>2.0.CO;2-5&origin=MSN. [Google Scholar]

[R24] Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. J. R. Stat. Soc. Ser. B. Stat. Methodol, 79(3):899–916, 2017. ISSN 1369–7412. URL 10.1111/rssb.12190. [DOI] [Google Scholar]

[R25] Tsybakov Alexandre B. Optimal aggregation of classifiers in statistical learning. Ann. Statist, 32(1):135–166, 2004. ISSN 0090–5364. doi: 10.1214/aos/1079120131. URL 10.1214/aos/1079120131. [DOI] [Google Scholar]

[R26] Tucker Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]

[R27] van der Vaart Aad W. and Wellner Jon A. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. ISBN 0–387-94640–3. doi: 10.1007/978-1-4757-2545-2. URL 10.1007/978-1-4757-2545-2. With applications to statistics. [DOI] [Google Scholar]

[R28] Wang Yu, Yin Wotao, and Zeng Jinshan. Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2017. [Google Scholar]

[R29] Yang Yun and Dunson David B. Bayesian conditional tensor factorizations for highdimensional classification. J. Amer. Statist. Assoc, 111(514):656–669, 2016. ISSN 01621459. URL 10.1080/01621459.2015.1029129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Zhang Xiang, Wu Yichao, Wang Lan, and Li Runze. A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res, 17(1):1–26, 2016. ISSN 1532–4435. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Determining the Number of Latent Factors in Statistical Multi-Relational Learning

Chengchun Shi

Wenbin Lu

Rui Song

Abstract

1. Introduction

2. The RESCAL Model

2.1. Model Setup

2.2. Identifiability

Lemma 1 (Identifiability).

3. Model Selection

Lemma 2.

Lemma 3.

Theorem 1.

4. Numerical Experiments

4.1. Implementation

4.2. Simulations

Table 1:

Table 2:

4.3. Real Data Experiments

Table 3:

5. Discussion

Acknowledgments

Appendix A. More on the Identifiability Constraint

Corollary 1.

Appendix B. More on the Technical Conditions

B.1. Discussion of Condition (A0)

B.2. Discussion of Condition (A1)

B.3. Discussion of Condition (A2)

Appendix C. Proofs

C.1. Proof of Lemma 1

C.2. Proof of Lemma 2

Lemma 4 (Mendelson et al. (2008), Lemma 2.3).

C.3. Proof of Lemma 3

C.4. Proof of Theorem 1

C.5. Proof of Corollary 1

Appendix D. Additional Simulation Results

Table 4:

Table 5:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases