Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jan 24.
Published in final edited form as: J Mach Learn Res. 2019;20:23.

Determining the Number of Latent Factors in Statistical Multi-Relational Learning

Chengchun Shi 1, Wenbin Lu 1, Rui Song 1
PMCID: PMC6980192  NIHMSID: NIHMS1051149  PMID: 31983896

Abstract

Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer s, RESCAL computes an s-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering.

The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.

Keywords: Information criteria, Knowledge graph, Model selection consistency, RESCAL model, Statistical relational learning, Tensor factorization

1. Introduction

Relational data is becoming ubiquitous in artificial intelligence and social network analysis. These data sets are in the form of graphs, with nodes and edges representing entities and relationships, respectively. Recently, a number of companies have developed and released their knowledge graphs, including the Google Knowledge Graph, Microsoft Bing’s Satori Knowledge Base, Yandex’s Object Answer, the Linkedln Knowledge Graph, etc. These knowledge graphs are graph-structured knowledge bases that store factual information as relationships between entities. They are created via the automatic extraction of semantic relationships from semi-structured or unstructured text (see Section II.C in Nickel et al., 2016). The data may be incomplete, noisy and contain false information. It is therefore of great importance to infer the existence of a particular relationship to improve the quality of these extracted information.

Statistical relational learning is primarily concerned with learning from relational data sets, and solving tasks such as predicting whether two entities are related (link prediction), identifying equivalent entities (entity resolution), and grouping similar entities based on their relationships (link-based clustering). Statistical relational models can be roughly divided into three categories: the relational graphical models, the latent class models and the tensor factorization models. Relational graphical models include probabilistic relational models (Getoor and Mihalkova, 2011) and Markov logic networks (MLN, Richardson and Domingos, 2006). These models are constructed via Bayesian or Markov networks. In latent class models, each entity is assigned to one of the latent classes and the probability of a relationship between entities depends on their corresponding classes. Two important examples include the stochastic block model (SBM, Nowicki and Snijders, 2001) and the infinite relational model (IRM, Kemp et al., 2006). IRM can be viewed as a nonparametric extension of SBM where the total number of clusters is not prespecified. Both models have received considerable attentions in the statistics and machine learning literature for community detection in networks.

Tensors are multidimensional arrays. Tensor factorization methods such as CANDE-COMP/PARAFAC (CP, Harshman and Lundy, 1994), Tucker (Tucker, 1966) and their extensions have found applications in a variety of fields. Kolda and Bader (2009) presented a thorough overview of tensor decompositions and their applications. Recently, tensor factorizations are being actively studied in the statistics literature and have becoming an emerging field of statistics. To name a few, Chi and Kolda (2012) developed a Poisson tensor factorization model for sparse count data. Yang and Dunson (2016) proposed a conditional tensor factorization model for high-dimensional classification with categorical predictors. Sun et al. (2017) proposed a sparse tensor decomposition method by incorporating a truncation step into the tensor power iteration step.

Relational data sets are typically expressed as (subject, predicate, object) triples and can be grouped as a third-order tensor. As a result, tensor factorization methods can be naturally applied to these data sets. Nickel (2013) proposed a RESCAL factorization model for statistical relational learning. Compared to other tensor factorization approaches such as CP and Tucker methods, RESCAL is more capable of detecting the correlations produced between multiple interconnected nodes. For relational data consisting of n entities, K types of relations, and a positive integer s, RESCAL computes an n × s factor matrix and an s × s × K core tensor. The factor matrix and the core tensor can be further used for link prediction, entity resolution and link-based clustering. Nickel et al. (2011) showed that a linear RESCAL model achieved better or comparable results on common benchmark data sets when compared to other existing methods such as MLN, DEDICOM (Harshman, 1978), IRM, CP, MRC (Kok and Domingos, 2007), etc. It was shown in Nickel and Tresp (2013) that a logistic RESCAL model could further improve the link prediction results.

Central to the empirical validity of RESCAL is the correct specification of the number of latent factors. Nickel et al. (2011) proposed to select this parameter via cross-validation. As commonly known for cross-validation methods, there’s no theoretical guarantee against overestimation. Besides, cross-validation can be computationally expensive, especially for large n and K. In the literature, model selection is less studied for tensor factorization methods. Allen (2012) and Sun et al. (2017) proposed to use Bayesian information criteria (BIC, Schwarz, 1978) for sparse CP decomposition. However, no theoretical results were provided for BIC. Indeed, we show in this paper that a BIC-type criterion may fail for the RESCAL model.

The contribution of this paper is twofold. First, we propose a general class of information criteria for the RESCAL model and prove their model selection consistency. Although we focus on the RESCAL model, our information criteria can be extended to select models for general tensor factorization methods with slight modification. The problem is nonstandard and challenging since both the factor matrix and the core tensor are not observed and need to be estimated. Besides, the model parameters are non-identifiable. Moreover, the derivation of model/tuning parameter selection consistency of information criteria usually relies on the (uniform) consistency of estimated parameters. For example, Fan and Tang (2013) derived the uniform consistency of the maximum likelihood estimators (MLEs) to prove the consistency of GIC (see Proposition 2 in that paper). Zhang et al. (2016) established the uniform consistency of the support vector machine solutions to prove the consistency of SVMICh (see Lemma 2 in that paper). The consistency of these estimators are due to the concavity (convexity) of the likelihood (or the empirical loss) functions. In contrast, for most tensor decomposition models including RESCAL, the likelihood (or the empirical loss) function is usually non-concave (non-convex) and may have multiple local solutions. As a result, the corresponding global maximizer (minimizer) may not be consistent even with the identifiability constraints. It remains unknown how to establish the consistency of the information criterion without consistency of the estimator. A key innovation in our analysis is to design a “proper” pseudometric and show that the global optimum is consistent under this specific pseudometric. We further establish the rate of convergence of the global optimum under this pseudometric as a function of n and K. Based on these results, we establish the consistency of our information criteria when K is either bounded or diverges at a proper rate of n. No parametric assumptions are imposed on the latent factors. Second, we introduce a scalable algorithm for estimating the parameters in the logistic RESCAL model. Despite the fact that a linear RESCAL model can be conveniently solved by an alternating least square algorithm (Nickel et al., 2011), there are lack of optimization algorithms for solving general RESCAL models. The proposed algorithm is based on the alternating direction method of multipliers (ADMM, Boyd et al., 2011) and can be implemented in a parallelized fashion.

The rest of the paper is organized as follows. We formally introduce the RESCAL model and study the parameter identifiability in Section 2. Our information criteria are presented in Section 3 and their model selection properties are investigated. Numerical examples are presented in Section 4 to examine the finite sample performance of the proposed information criteria. Section 5 concludes with a summary and discussion of future extensions. All the proofs are given in the Appendix.

2. The RESCAL Model

This section is structured as follows. We introduce the RESCAL model in Section 2.1. In Section 2.2, we study the identifiability of parameters in the model.

2.1. Model Setup

In knowledge graphs, facts can be expressed in the form of (subject, predicate, object) triples, where subject and object are entities and predicate is the relation between entities. For example, consider the following sentence from Wikipedia:

Jon Snow is a fictional character in the A Song of Ice and Fire series of fantasy novels by American author George R. R. Martin, and its television adaptation Game of Thrones.

The information contained in this message can be summarized into the following set of (subject, predicate, object) triples:

Subject Predicate Object
Jon Snow character in A Song of Ice and Fire
Jon Snow character in Game of Thrones
A Song of Ice and Fire genre novel
Game of Thrones genre television series
George R.R. Martin author of A Song of Ice and Fire
George R.R. Martin profession novelist

In this example, we have a total of 7 entities, 4 types of relations and 6 triples. More generally, let E={e1,,en} denote the set of all entities and R={r1,,rK} denote the set of all relation types. The number of relations K is either bounded or diverges with n. Assuming non-existing triples indicate false relationships, we can construct a third-order binary tensor

Y={Yijk}i,j{1,,n},k{1,,K},

such that

Yijk={1,if a triple(ei,rk,ej)exists,0,otherwise.

The RESCAL model is defined as follows. For each entity ei, a latent vector ai,0s0 is generated. The Yijk’s are assumed to be conditionally independent given all latent factors {ai,0}i=1n. Besides, it is assumed that

Pr(Yijk=1|{ai,0}i=1n)=g(ai,0TRk,0aj,0), (1)

for some strictly monotone link function g and s0 × s0 matrices R1,0,,RK,0. In the above model, ai,0 corresponds to the latent representation of the ith entity and Rk,0 specifies how these ai,0’s interact for the k-th relation. To account for asymmetric relations, we do not restrict Rk,0’s to symmetric matrices. When the relations are symmetric, i.e.,

Pr(Yijk=1|{ai,0}i=1n)=Pr(Yjik=1|{ai,0}i=1n),i,j,k,

one can impose the symmetry constraints and obtain a similar derivation.

For continuous Yijk, a related tensor factorization model is the TUCKER-2 decomposition, which decomposes the tensor into

Yijk=ai,0TRk,0bj,0+eijk,i,j,k, (2)

for some a1,0,,an,0s1,b1,0,,bn,0s2,R1,0,,RK,0s1×s2 and some (random) errors {eijk}ijk. By Equation 1, RESCAL can be interpreted as a “nonlinear” TUCKER-2 model with the additional constraints that s1 = s2 = s0 and ai,0=bi,0,i.

CP decomposition is another important tensor factorization method that decomposes a tensor into a sum of rank-1 tensors. It assumes that

Yijk=s=1s0ai,sbj,srk,s+eijk,

for some {ai,s}i,s,{bj,s}j,s,{rk,s}k,s and {eijk}ijk. Define ai,0=(ai,1,,ai,s0)T and bi,0=(bi,1,,bi,s0)T. In view of Equation 2, CP is a special TUCKER-2 model with the constraints that s1 = s2 = s0 and Rk,0=diag(rk,1,,rk,s0) where diag(rk,1,,rk,s0) is a diagonal matrix with the sth diagonal elements being rk,s.

In this paper, the proposed information criteria are designed in particular for the RESCAL model. However, they can be extended to estimate s0 in a more general tensor factorization framework including CP and TUCKER-2 models. We discuss this further in Section 5.

2.2. Identifiability

The parameterization in Equation 1 is not identifiable. To see this, for any nonsingular matrix Cs0×s0, we define ai=C1ai,0,Rk=CTRk,0C,i,k. Observe that

ai,0TRk,0aj,0=aiTRkaj,i,j,k

and hence we have

Pr(Yijk=1)=g(aiTRkaj).

Let A0=[a1,0,,an,0]T. We impose the following condition.

(A0) (i) Assume A0 has full column rank. (ii) Assume the matrix [R1,0T,,RK,0T] has full row rank.

(A0)(i) requires the latent factors to be linearly independent. (A0)(ii) holds when at least one of the Rk,0’s has full rank. Under Condition (A0), the following lemma states that the RESCAL model is identifiable up to a nonsingular linear transformation. In Section B.1 of the Appendix, we show (A0) is also necessary to guarantee such identifiability when R1,0,,RK,0 are symmetric.

Lemma 1 (Identifiability).

Assume (A0) holds. Assume there exist some {ai}i,{Rk}k such that ais0, Rks0×s0 and

g(ai,0TRk,0aj,0)=g(aiTRkaj),i,j,k.

Then, there exists some invertible matrix Cs0×s0 such that

ai=C1ai,0andRk=CTRk,0C.

To fix the nonsingular transformation indeterminacy, we adopt a specific constrained parameterization and focus on estimating {ai*}i and {Rk*}k where

ai*=(As0,01)Tai,0andRk*=As0,0Rk,0As0,0T,

where As0,0=[a1,0,,as0,0]T. Observe that

[a1*,,as0*]=(As0,01)T[a1,0,,as0,0]=(As0,01)TAs0,0T=Is0,

where Is0 stands for an s0 × s0 identity matrix. Therefore, the first s0ai*’s are fixed as long as As0,0 is nonsingular. By Lemma 1, the parameters {ai*}i and {Rk*}k are estimable.

From now on, we only consider the logistic link function for simplicity, i.e, g(x) = 1/{1 + exp(−x)}. Results for other link functions can be similarly discussed.

3. Model Selection

Parameters {ai*}i=1n and {Rk*}k=1K can be estimated by maximizing the (conditional) log-likelihood function. Since we use the logistic link function, the log-likelihood is equal to

ln(Y;{ai}i,{Rk}k)=log(ijkg(aiTRkaj)Yijk{1g(aiTRkaj)}1Yijk)=ijk(Yijklog{g(aiTRkaj)}+(1Yijk)log{1g(aiTRkaj)})=ijk(YijkaiTRkajlog{1+exp(aiTRkaj)}),

where the first equality is due to the conditional independence assumption.

We assume the number of latent factors s0 is fixed. For any s{1,,smax} here smax is allowed to diverge with n and satisfies smaxs0, we define the following constrained maximum likelihood estimator

({a^i(s)}i,{R^k(s)}k)=arg maxa1(s),,an(s)Θa(s)vec(R1(s)),,vec(RK(s))Θr(s)ln(Y;{ai(s)}i,{Rk(s)}k), (3)
subject to[a1(s),,as(s)]=Is, (4)

for some Θa(s)s,Θr(s)s2, where the vec(·) operator stacks the entries of a matrix into a column vector. To estimate the number of latent factors, we define the following likelihood-based information criteria

IC(s)=2ln(Y;{a^i(s)}i,{R^k(s)}k)sκ(n,K),

for some penalty functions κ(·,·). The estimated number of latent factors is given by

s^=arg maxs{1,,smax}IC(s). (5)

In addition to the constraint in Equation 4, there exist many other constraints that would make the estimators identifiable. The choice of the identifiability constraints might affect the value of IC. However, it wouldn’t affect the value of s^. Detailed discussions can be found in Section A of the Appendix.

A major technical difficulty in establishing the consistency of IC is due to the nonconcavity of the objective function given in Equation 3. For any {aj}j{1,,n},{Rk}k{1,,K}, let

β=(a1T,,anT,vec(R1)T,,vec(RK)T)T,

be the set of parameters.

For any b1,,bns,T1,,TKs×s, we define

ζ=(b1T,,bnT,vec(T1)T,,vec(TK)T)T.

With some calculations, we can show that

ζT2lnββTζ=ijkπijk(1πijk)(biTRkaj+aiTRkbj+aiTTkaj)2I1+ijk(πijkYijk)(2biTRkbj+biTTkaj+aiTTkbj)I2,

where πijk=exp(aiTRkaj)/{1+exp(aiTRkaj)}. Here, I1 is nonnegative. However, I2 can be negative for some β and ζ. Therefore, the negative Hessian matrix is not positive semidefinite and the likelihood function is not concave. As a result, a^is0 and R^ks0 may not be consistent to a^i* and R^k*, even with the identifiability constraints in Equation 4. Here, the presence of I2 is due to the bilinear formulation of the RESCAL model.

Let θijk=aiTRkaj. Notice that ln is concave in θijk, ∀i, j, k. This motivates us to consider the following pseudometric:

d({ai,1(s1)}i,{Rk,1(s1)}k1;{ai,2(s2)}i,{Rk,2(s2)}k2)={1n2Kijk((ai,1(s1))T(Rk,1(s1))Taj,1(s1)(ai,2(s2))T(Rk,2(s2))Taj,2(s2))2}1/2,

for any integers s1, s2 > 0 and ai,1(s1)s1, Rk,1(s1)s1×s1, ai,2(s2)s2, Rk,2(s2)s2×s2. Apparently, d(·,·) is nonnegative, symmetric and satisfies the triangle inequality. Below, we establish the convergence rate of

d({a^i(s)}i,{R^k(s)}k;{ai*}i,{Rk*}k).

We first introduce some notation. For any s > s0, we define

ai,0(s)={((ai,0)T,0ss0T)T,i{s0+1,,s},((ai,0)T,0,,0,is011,0,,0)Tsi,i{s0+1,,s},

and

Rk,0(s)=(Rk,0Or,srOsr,rOsr,sr),

where 0q denotes a q-dimensional zero vector and Op,q is an p × q zero matrix. With a slight abuse of notation, we write ai,0(s0)=ai,0 and Rk,0(s0)=Rk,0. Clearly, for any ss0, we have

(ai,0(s))TRk,0(s)aj,0(s)=ai,0TRk,0aj,0,i,j,k,

and hence

({ai,0(s)}i,{Rk,0(s)}k)=argmax{ai(s)}i,{Rk(s)}kEln(Y;{ai(s)}i,{Rk(s)}k),

Let

ai(s)*=(As,01)Tai,0(s)andRk(s)*=As,0Rk,0(s)As,0T,

where As,0=[a1,0(s),,as,0(s)]T. When As0,0 is invertible, As,0’s are invertible for all s > s0. The defined {ai(s)*}’s satisfy the identifiability constraints in Equation 4 for all ss0. We make the following assumption.

(A1) Assume ai(s)*Θa(s) and vec(Rk(s)*)Θr(s), i=1,,n,k=1,,K and s0ssmax. In addition, assume supxΘa(s)x2ωa, supyΘr(s)y2ωr for some ωa,ωr>0.

Lemma 2.

Assume (A1) holds, smax=o{n/log(nK)}. Then there exists some constant C0 > 0 such that the following event occurs with probability tending to 1,

maxs{s0,,smax}1s2d2({a^i(s)}i,{R^k(s)}k;{ai*}i,{Rk*}k)exp(2ωa2ωr)(n+K)(logn+logK)n2K.

Under the condition smax=o{n/log(nK)}, we have that

smax2(n+K)(logn+logK)n2Ksmax2(2nlogn+2KlogK)n2K2smax2lognn+2smax2logKn2=o(1).

When ωa and ωr are bounded, it follows that

maxs{s0,,smax}d2({a^i(s)}i,{R^k(s)}k;{ai*}i,{Rk*}k)smax2exp(2ωa2ωr)(n+K)(logn+logK)n2K=o(1).

Hence, {a^i(s)}i and {R^k(s)}k are consistent under the pseudometric d for all overfitted models. On the contrary, for underfitted models, we require the following conditions.

(A2) Assume there exists some constant c¯>0 such that λmin(A0TA0)nc¯.

(A3) Let K¯=λmin(k=1KRk,0TRk,0). Assume liminfnK¯>0.

Lemma 3.

Assume (A2) and (A3) hold. The for any s{1,2,,s01}, we have

d2({a^i(s)}i,{R^k(s)}k;{ai*}i,{Rk*}k)c¯2K¯K,

where c¯ and K¯ are defined in (A2) and (A3), respectively.

Assumption (A3) holds if there exists some k0{1,,K} such that

liminfnλmin(Rk0,0Rk0,0T)>0.

When K¯cK for some constant c > 0, it follows from Lemma 3 that

liminfnd({a^i(s)}i,{R^k(s)}k;{ai*}i,{Rk*}k)>0.

Based on these results, we establish the consistency of s^ defined in Equation 5 below. For any sequences {an} and {bn}, we write an ~ bn if there exist some universal constants c1, c2 > 0 such that c1anbnc2an.

Theorem 1.

Assume (A1)-(A3) hold, K~nl0 or some 0l01,smax=o{n/log(nK)},liminfnn(1l0)/2K¯exp(2ωa2ωr)logn. Assume κ(n, K) satisfies

smaxexp(ωa2ωr)(n+K)(logn+logK)κ(n,K)n2K¯exp(ωa2ωr). (6)

Then, we have Pr(s^=s0)1 where s^ is defined in Equation 5.

Let c(n,K)=κ(n,K)(n+K)1(logn+logK)1. When smax,ωa,ωr are bounded, it follows from Theorem 1 that IC is consistent provided that c(n,K) and c(n,K)=o(nK¯/logn). Define

τα(n,K)=(n+K)αmaxα(n,K),

for some α ≥ 0. Note that

1τα(n,K)2α. (7)

Consider the following criteria:

ICα(s)=2ln(Y;{a^i(s)}i,{R^k(s)}k)sτα(n,K)(n+K)(logn+logK)log{log(nK)}. (8)

Note that the term log{log(nK)} satisfies that log{log(nK)} and log{log(nK)}=o(n/logn). It follows from Equation 7 and Theorem 1 that ICα is consistent for all α ≥ 0. When α > 0, the term τα(n,K) adjust the model complexity penalty upwards. We notice that Bai and Ng (2002) used a similar finite sample correction term in their proposed information criteria for approximate factor models. Our simulation studies show that such adjustment is essential to achieve selection consistency for large K.

Conditions (A1) and (A2) are directly imposed on the realizations of a1,0,,an,0. In Section B.2 and B.3, we consider an asymptotic framework where a1,0,,an,0 are i.i.d according to some distribution function and show (A1) and (A2) hold with probability tending to 1. Therefore, under this framework, we still have Pr(s^=s0)1. The consistency of our information criterion remains unchanged.

Observe that we have a total of n × n × K = n2K observations. Consider the following BIC-type criterion:

BIC(s)=2ln(Y;{a^i(s)}i,{R^k(s)}k)slog(n2K). (9)

The model complexity penalty in BIC satisfies

log(n2K)=2logn+logK(n+K)(logn+logK).

Hence, it does not meet Condition (6) in Theorem 1. As a result, BIC may fail to identity the true model. As shown in our simulation studies, BIC will choose overfitted models and is not selection consistent.

4. Numerical Experiments

This section is organized as follows. In Section 4.1, we introduce our algorithm for computing the maximum likelihood estimators of a logistic RESCAL model. Simulation studies are presented in Section 4.2. In Section 4.3, we apply the proposed information criteria to a real dataset.

4.1. Implementation

In this section, we propose an algorithm for computing {a^i(s)}i and {R^k(s)}k. The algorithm is based upon a 3-block alternating direction method of multipliers (ADMM). Set Θa(s)=s, Θr(s)=s×s, [a1(s),,as(s)]=Is, a^i(s)’s and R^k(s)’s are defined by

({a^i(s)}i=(s+1)n,{R^k(s)}k)=argmax{ai(s)}i=s+1n,{Rk(s)}kln(Y;{ai(s)}i,{Rk(s)}k), (10)

where

ln(Y;{ai(s)}i,{Rk(s)}k)=ijk(Yijk(ai(s))TRk(s)aj(s)log[1+exp{(ai(s))TRk(s)aj(s)}]).

For any b1,,bns, define

l¯n(Y;{ai(s)}i,{Rk(s)}k,{bi(s)}i)=ijk(Yijk(ai(s))TRk(s)bj(s)log[1+exp{(ai(s))TRk(s)bj(s)}]).

Fix [b1(s),,bs(s)]=Is, the optimization problem in Equation 10 is equivalent to

({a^i(s)}i=s+1n,{R^k(s)}k,{b^i(s)}i=s+1n)=argmax{ai(s)}i=s+1n,{Rk(s)}k{bi(s)}i=s+1nl¯n(Y;{ai(s)}i,{Rk(s)}k,{bi(s)}i)
subject toai(s)=bi(s),i=s+1,,n.

We then derive its augmented Lagrangian, which gives us

Lρ({ai(s)}i=s+1n,{Rk(s)}k,{bi(s)}i=s+1n,{vi(s)}i=s+1n)=l¯n(Y;{ai(s)}i,{Rk(s)}k,{bi(s)}i)+i=s+1nρ(ai(s)bi(s))Tvi(s)+i=s+1nρ2ai(s)bi(s)22,

where ρ > 0 is a penalty parameter and, vs+1(s),,vn(s)s.

Applying dual descent method yields the following steps, with l denotes the iteration number:

{ai,l+1(s)}i=s+1n=argmin{ai(s)}i=s+1nLρ({ai(s)}i=s+1n,{Rk,l(s)}k,{bi,l(s)}i=s+1n,{vi,l(s)}i=s+1n), (11)
{Rk,l+1(s)}k=1K=argmin{Rk(s)}k=1KLρ({ai,l+1(s)}i=s+1n,{Rk(s)}k,{bi,l(s)}i=s+1n,{vi,l(s)}i=s+1n), (12)
{bi,l+1(s)}i=s+1n=argmin{bi(s)}i=s+1n+1Lρ({ai,l+1(s)}i=s+1n,{Rk,l+1(s)}k,{bi(s)}i=s+1n,{vi,l(s)}i=s+1n), (13)
vi,l+1(s)=vi,l(s)+ai,l(s)bi,l(s),i=s+1,,n.

Let us examine Equation 1113 in more details. In Equation 11, we rewrite the objective function as

Lρ({ai(s)}i=s+1n,{Rk,l(s)}k,{bi,l(s)}i=s+1n,{vi,l(s)}i=s+1n)=i=s+1n{j,k(log[1+exp{(ai(s))TRk,l(s)bj,l(s)}]Yijk(ai(s))TRk,l(s)bj,l(s))+ρ(ai(s)bi,l(s))Tvi,l(s)+ρ2ai(s)bi,l(s)22}.

Note that Lρ can be represented as a separable sum of functions. As a result, ai,l+1(s)‘s can be solved in parallel. More specifically, we have

ai,l+1(s)=argmina(s){j,k(log[1+exp{(ai(s))TRk,l(s)bj,l(s)}]Yijk(ai(s))TRk,l(s)bj,l(s))+ρ(ai(s)bi,l(s))Tvi,l(s)+ρ2ai(s)bi,l(s)22}.

Hence, each ai,l+1(s) can be computed by solving a ridge type logistic regression with responses {Yijk}j,k and covariates {Rk,l(s)bj,l(s)}j,k.

In Equation 12, each Rk,l+1(s) can be independently updated by solving a logistic regression with responses {Yijk}i,j and covariates bj,l(s)ai,l+1(s), i.e,

vec(Rk,l+1(s))=argminrk(s)s2ij{log(1+exp[{(bj,l(s))T(ai,l+1(s))T}rk(s)])Yijk{(bj,l(s))T(ai,l+1(s))T}rk(s)},

where ⊗ denotes the Kronecker product.

Similar to Equation 11, each bi,l+1(s) in Equation 13 can be independently computed by solving a ridge type regression with responses {Yijk}j,k and covariates {(Rk,l+1(s))Taj,l+1(s)}j,k.

Using similar arguments in Theorem 2 in Wang et al. (2017), we can show that the proposed 3-block ADMM algorithm converges for any sufficiently large ρ. In our implementation, we set ρ = nK/2. To guarantee global convergence, we randomly generate multiple initial estimators and solve the optimization problem multiple times based on these initial values.

4.2. Simulations

We simulate {Yijk}ijk from the following model:

Pr(Yijk=1|{ai}i,{Rk}k)=exp(aiTRkaj)1+exp(aiTRkaj),
a1,a2,,an~iidN(0,1),
R1=R2==RK=diag(1,1,1,1,,1,1s0)

where N(0, 1) stands for a standard normal random variable and diag(v1,,vq) denotes a q × q diagonal matrix with the jth element equal to vj.

We consider six simulation settings. In the first three settings, we fix K = 3 and set n = 100,150 and 200, respectively. In the last three settings, we increase K to 10, 20, 50, and set n = 50. In each setting, we further consider three scenarios, by setting s0 = 2, 4 and 8. Let smax = 12. The ADMM algorithm proposed in Section 4.1 is implemented in R. Some subroutines of the algorithm are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015) to facilitate the computation. We compare the proposed ICα (see Equation 8) with the BIC-type criterion (see Equation 9). In ICα, we set α = 0, 0.5 and 1. Note that when α = 0, we have

τα(n,K)=(n+K)αmaxα(n,K)=1.

Reported in Table 1 and 2 are the percentage of selecting the true models (TP) and the average of s^ selected by IC0, IC0.5, IC1 and BIC over 100 replications.

Table 1:

Simulation results for Setting I, II and III (standard errors in parenthesis)

s0 = 2 s0 = 4 s0 = 8
n = 100, K = 3 TP s^ TP s^ TP s^
IC0 0.97 (0.02) 2.03 (0.02) 0.97 (0.02) 4.03 (0.02) 0.90(0.03) 7.90 (0.03)
IC0.5 0.97 (0.02) 2.03 (0.02) 0.98 (0.01) 4.02 (0.01) 0.90(0.03) 7.90 (0.03)
IC1 0.97 (0.02) 2.03 (0.02) 0.98 (0.01) 4.02 (0.01) 0.89(0.03) 7.89 (0.03)
BIC 0.00 (0.00) 11.99 (0.01) 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 11.99 (0.01)

n = 150, K = 3 TP s^ TP s^ TP s^
IC0 0.99 (0.01) 2.01 (0.01) 0.97 (0.02) 4.03 (0.02) 0.96(0.02) 8.04 (0.02)
IC0.5 0.99 (0.01) 2.01 (0.01) 0.97 (0.02) 4.03 (0.02) 0.96(0.02) 8.04 (0.02)
IC1 0.99 (0.01) 2.01 (0.01) 0.97 (0.02) 4.03 (0.02) 0.96(0.02) 8.04 (0.02)
BIC 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 11.98 (0.01)

n = 200, K = 3 TP s^ TP s^ TP s^
IC0 0.99 (0.01) 2.01 (0.01) 0.95 (0.02) 4.05 (0.02) 0.95(0.02) 8.05 (0.02)
IC0.5 0.99 (0.01) 2.01 (0.01) 0.95 (0.02) 4.05 (0.02) 0.95(0.02) 8.05 (0.02)
IC1 0.99 (0.01) 2.01 (0.01) 0.95 (0.02) 4.05 (0.02) 0.95(0.02) 8.05 (0.02)
BIC 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 11.99 (0.01) 0.00 (0.00) 11.98 (0.01)

Table 2:

Simulation results for Setting IV, V and VI (standard errors in parenthesis)

s0 = 2 s0 = 4 s0 = 8
n = 50, K = 10 TP s^ TP s^ TP s^
IC0 1.00 (0.00) 2.00 (0.00) 0.97 (0.02) 4.03 (0.02) 0.69(0.05) 7.91 (0.06)
IC0.5 1.00 (0.00) 2.00 (0.00) 0.97 (0.02) 4.03 (0.02) 0.66(0.05) 7.75 (0.06)
IC1 1.00 (0.00) 2.00 (0.00) 0.98 (0.01) 4.02 (0.01) 0.60(0.05) 7.62 (0.06)
BIC 0.00 (0.00) 11.81 (0.06) 0.00 (0.00) 11.60 (0.06) 0.01 (0.01) 11.67 (0.07)

n = 50, K = 20 TP s^ TP s^ TP s^
IC0 0.97 (0.02) 2.03 (0.02) 0.95 (0.02) 4.05 (0.02) 0.73(0.04) 8.46 (0.10)
IC0.5 0.97 (0.02) 2.03 (0.02) 0.98 (0.01) 4.02 (0.01) 0.87(0.03) 8.09 (0.03)
IC1 0.98 (0.01) 2.02 (0.02) 1.00 (0.00) 4.00 (0.00) 0.79(0.04) 7.99 (0.05)
BIC 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 11.92 (0.03) 0.00 (0.00) 11.99 (0.01)

n = 50, K = 50 TP s^ TP s^ TP s^
IC0 0.98 (0.01) 2.02 (0.01) 0.93 (0.03) 4.07 (0.03) 0.17(0.04) 11.24 (0.15)
IC0.5 0.99 (0.01) 2.01 (0.01) 0.97 (0.02) 4.03 (0.02) 0.76(0.04) 8.24 (0.05)
IC1 1.00 (0.00) 2.00 (0.00) 0.98 (0.01) 4.02 (0.01) 0.79(0.04) 7.99 (0.05)
BIC 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 12.00 (0.00) 0.00 (0.00) 11.99 (0.01)

It can be seen from Table 1 and 2 that BIC fails in all settings. It always selects overfitted models. On the contrary, the proposed information criteria are consistent for most of the settings. For example, under settings where s0 = 2 or 4, TPs of IC0, IC0.5 and IC1 are larger than or equal to 93%. When s0 = 8, expect for the last setting, TPs of the proposed information criteria are no less than 60% for all cases.

IC0, IC0.5 and IC1 perform very similarly for small K. In the first three settings, TPs of these three information criteria are nearly the same for all cases. However, IC0.5 and IC1 are more robust than IC0 for large K. This can be seen in the last scenario of Setting 6, where the TP of IC0 is no more than 20%. Besides, in the last two settings, TP of IC0 is smaller than IC0.5 and IC1 for all cases. These differences are due to the finite sample correction term τα(n,K). As commented before, τ0.5(n,K) and τ1(n,K) increase the model complexity penalty term in IC0.5 and IC1 to avoid overfitting for large K.

In Section D of the Appendix, we examine the performance of our proposed information criteria under the scenario where a1,a2,,an~iidN(0,{0.5|ij|}i,j=1,,s0). Results are similar to those presented at Table 1 and 2.

4.3. Real Data Experiments

In this section, we apply the proposed information criteria to the “Social Evolution” dataset (Madan et al., 2012). This dataset comes from MIT’s Human Dynamics Laboratory. It tracks everyday life of a whole undergraduate MIT dormitory from October 2008 to May 2009. We use the survey data, resulting in n = 84 participants and K = 5 binary relations. The five relations are: close relationship, political discussion, social interaction and two social media interaction.

We compute {a^i(s)}iand{R^k(s)}k for s={1,,12} and select the number of latent factors using the proposed information criteria and BIC. It turns out that IC0, IC0.5 and IC1 all suggest the presence of 9 factors. In contrast, BIC selects 12 factors. To further evaluate the number of latent factors selected by the proposed information criteria, we consider the following cross-validation procedure. For any s[1,,12], we randomly select 80% of the observations and estimate {a^i(s)}i and {R^k(s)} by maximizing the observed likelihood function based on these training samples. Then we compute

π^ijk=exp{(a^i(s))TR^k(s)a^j(s)}1+exp{(a^i(s))TR^k(s)a^j(s)}.

Based on these predicted probabilities, we calculate the area under the precision-recall curve (AUC) on the remaining 20% testing samples.

Reported in Table 3 are the AUC scores averaged over 100 replications. For any s{1,,12}, we denoted by AUCs the corresponding AUC score. It can be seen from Table 3 that AUCs first increases and then decreases as s increases. The maximum AUC score is achieved at s = 10. Observe that AUC9 is very close to AUC10, and it is larger than the remaining AUC scores. This demonstrates that the proposed information criteria select less latent factors while achieve better or similar link prediction results when compared to BIC.

Table 3:

AUC scores

s 1 2 3 4 5 6 7 8 9 10 11 12
AUC 0.7201 0.8341 0.8952 0.9095 0.9257 0.9364 0.9444 0.9486 0.9513 0.9518 0.9485 0.9467

5. Discussion

In this paper, we propose information criteria for selecting the number of latent factors in the RESCAL tensor factorization model and prove their model selection consistency. Although we focus on the logistic RESCAL model, the proposed information criteria can be applied to general tensor factorization models. More specifically, consider the following class of models:

Yijk=g(ai,0TRk,0bj,0)+eijk,i,j{1,,n},k{1,,K}, (14)

with any of (or without) the following constraints:

(C1) Rk,0 is diagonal;

(C2) ai,0=bi,0 for i{1,,n},

for some strictly increasing function g, ai,0,bi,0s0,Rk,0s0×s0 and some mean zero random errors {eijk}ijk.

As commented in Section 2.1, such representation includes the RESCAL, CP and TUCKER-2 models. Specifically, it reduces to the TUCKER-2 model by setting g to be the identity function. If further (C1) holds, then the model in Equation 14 reduces to the CP model. When (C2) holds, it corresponds to the RESCAL model. Consider the following information criteria,

IC(s)=ln(Y;{a^i}i,{R^k}k,{b^i}i)sκ(n,K),

where ln stands for the likelihood function and a^i,R^k,b^i are the corresponding (constrained) MLEs. Similar to Theorem 1, we can show that with some properly chosen κ(n,K), IC is consistent under this general setting.

Currently, we assume the tensor Y is completely observed. When some of the Yijk’s are missing, we can calculate a^i(s)’s and R^k(s)’s by maximizing the following observed likelihood function

argmaxa1(s),,an(a)Θa(s)vec(R1(s)),,vec(RK(s))ΘR(s)(i,j,k)Nobs(Yijk(ai(s))TRk(s)aj(s)log[1+exp{(ai(s))TRk(s)aj(s)}]),
subject to[a1(s),,as(s)]=Is,

where Nobs denotes the set of the observed responses. The above optimization problem can also be solved by a 3-block ADMM algorithm. Define the following class of information criteria,

ICobs(s)=(i,j,k)Nobs(Yijk(a^i(s))TR^k(s)a^j(s)log[1+exp{(a^i(s))TR^k(s)a^j(s)}])p^sκ(n,K),

where p^=|Nobs|/(n2K) denotes the percentage of observed responses. Consistency of ICobs can be similarly studied.

Acknowledgments

The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.

Appendix A. More on the Identifiability Constraint

Let π(·) be a permutation function of {1,,n}. Alternative to our estimator defined in Equation 3, one may consider

({a^i,π(s)}i,{R^k,π(s)}k)=argmaxa1(s),,an(s)Θa(s)Vec(R1(s)),,Vec(RK(s))Θr(s)ln(Y;{ai(s)}i,{Rk(s)}k),
subject to[aπ(1)(s),,aπ(s)(s)]=Is,

and the corresponding information criteria

ICπ(s)=2ln(Y;{a^i,π(s)}i,{R^k,π(s)}k)sκ(n,K).

Since ln(Y;{ai(s)}i,Rk(s)k)=ln(Y;{C1ai(s)}i,{CTRk(s)C}k) for any invertible matrix Cs×s,({a^i,π(s)}i,{R^k,π(s)}k) is also the maximizer of ln(Y;{ai(s)}i,{Rk(s)}k) subject to the constraint that [aπ(1)(s),,aπ(s)(s)] is invertible. Similarly, the estimator ({a^i(s)}i,{R^k(s)}k) is the maximizer of ln(Y;{ai(s)}i,{Rk(s)}k) subject to the constraint that [a1(s),,as(s)] is invertible. As a result, we have IC(s) = ICπ(s) as long as

[a^π(1)(s),,a^π(s)(s)]is invertible and[a^1,π(s),,a^s,π(s)]is invertible. (15)

However, it remains unknown whether Equation 15 holds or not. Hence, there’s no guarantee that IC(s) = ICπ(s). This means the choice of the identifiability constraint might affect the value of our proposed information criterion.

In the following, we prove

Pr(π{1,,n}{1,,n}|{π(1),,π(n)}|=n{argmaxs=1,,smax{ICπ(s)}=s0})1. (16)

This means with probability tending to 1, all the information criteria with different identifiability constraints will select the true model. Therefore, the choice of the identifiability constraint will not affect the performance of our method. For any s{s0,,smax}, let As,0,π=[aπ(1),0(s),,aπ(s),0(s)]T,

ai,π(s)*=(As,0,π1)Tai,0(s)andRk,π(s)*=As,0,πRk,0(s)As,0,πT.

We need the following condition.

(A4) Assume ai,π(s)*Θa(s) and vec(Rk,π(s)*)Θr(s),i=1,,n,k=1,,K,s0ssmax and any permutation function π(·). In addition, assume supxΘa(s)x2ωa, supyΘr(s)y2ωr for some ωa,ωr>0.

Corollary 1.

Assume (A2)-(A4) hold, K~nl0 for some 0l01,smax=o{n/log(nK)}liminfnn(1l0)/2K¯exp(2ωa2ωr)logn. Assume κ(n,K) satisfies Condition (6). Then, (16) is satisfied.

Appendix B. More on the Technical Conditions

B.1. Discussion of Condition (A0)

In this section, we show the necessity of (A0) when the matrices R1,0,,RK,0 are symmetric. More specifically, when (A0) doesn’t hold, we show there exist some 0s<s0,a¯1,0,,a¯n,0s and R¯1,0,,R¯K,0s×s such that

a¯i,0TR¯k,0a¯j,0=ai,0TRk,0aj,0,1i,jn,1kK. (17)

Let’s first consider the case where rank(A0) = s for some s < s0. Thus, it follows that

A0=A¯0C,

for some A¯0n×s and Cs×s0. Set a¯i,0 to be the ith row of A¯0 and R¯k,0=CRk,0CT, Equation 17 is thus satisfied. In addition, the new matrix A¯0 shall have full column rank.

Let R0=(R1,0T,,RK,0T)T,. Consider the case where rank(R0) = s for some s < s0. It follows from the singular value decomposition that

R0=U0Λ0V0T, (18)

for some diagonal matrix Λ0s×s and some matrices U0Ks0×s, V0s0×s that satisfy U0TU0=V0TV0=Is. Denoted by Uk, 0 the submatrix of U0 formed by rows in {(k1)s0+1,(k1)s0+2,,ks0} and columns in {1, 2,...,s}. It follows from Equation 18 that

Rk,0=Uk,0Λ0V0T,1kK.

Since Rk,0 is symmetric, we have Uk,0Λ0V0T=V0Λ0Uk,0T=Rk,0. Notice that V0TV0=Is. It follows that Uk,0Λ0=V0Λ0Uk,0TV0. Therefore, we have

Rk,0=V0Λ0Uk,0TV0V0T. (19)

Define a¯i,0=V0Tai,0,1in and R¯k,0=Λ0Uk,0TV0. In view of Equation 19, it is immediate to see that Equation 17 holds. Since V0TV0=Is, we have R¯k,0=V0TV0Λ0Uk,0TV0=V0TRk,0V0. As a result, R¯1,0,,R¯K,0 are also symmetric. Suppose R¯0=(R¯1,0T,,R¯K,0T)T doesn’t have full column rank. Using the same arguments, we can find some s<s,a˜1,0,,a˜n,0s,R˜1,0,,R˜K,0s×s such that

a˜i,0TR˜k,0a˜j,0=ai,0TRk,0aj,0,1i,jn,1kK. (20)

We may repeat this procedure until we find some a˜i,0, R˜k,0’s satisfy Equation 20 and that the matrix (R˜1,0T,,R˜K,0T)T has full column rank.

B.2. Discussion of Condition (A1)

In this section, we consider an asymptotic framework where a1,0,,an,0 are i.i.d according to some distribution function and show (A1) holds with probability tending to 1. For any q-dimensional vector q, let q2 denote its Euclidean norm. For any m × q matrix Q, Q2 stands for the spectral norm of Q while QF denotes its Frobenius norm. For simplicity, we assume Θa={xs0:x2ωa} and Θr={ys02:y2ωr}. Assume maxk=1,,KRk,0F=O(1),a1,02 is bounded with probability 1. In addition, assume there exist some constants c0,t0>0 such that

Pr([a1,0,,as0,0]1F>t)c0t1,tt0. (21)

When s0 = 1, the above condition is closely related to the margin assumption (Tsybakov, 2004; Audibert and Tsybakov, 2007) in the classification literature. It automatically holds when a1,0 has a bounded probability density function.

In the following, we show with proper choice of ωa and ωr, (A1) holds with probability tending to 1. By the identifiability constraints, ai(s)*2=1,i=1,,s and s=s0,,smax. When ωa,ωr, we have for sufficiently large n,

Pr(ai(s)*Ωa)=1,1is,s0ssmax. (22)

By the definition of As,0, we have

As,01=(As0,01Os0×(ss0)A(s0+1):s,0As0,01Iss0), (23)

where A(s0+1):s,0=[as0+1,0,,as,0]T .It follows that

ai(s)*=(As0,01ai,0(0,,0is01,1,0,,0si)TA(s0+1):s,0As0,01ai,0),s<in,s0ssmax.

Under the given conditions, we have ai,02ω0 with probability 1 for some constant ω0 > 0. Therefore, we have

maxs0sssmaxs<inai(s)*2ω0As0,012+1+smaxω02As0,0121+(1+smaxω0)ω0As0,01F.

By (21), it is immediate to see that

Pr(maxs0ssmax1inai(s)*2>ωa)0,

as long as ωasmax. This together with Equation 22 yields

Pr(maxs0ssmax1inai(s)*2>ωa)0.

Combining Equation 23 with the definition of Rk(s)* yields

Rk(s)*=(As0,01Rk,0(As0,01)TAs0,01Rk,0(As0,01)TA(s0+1):s,0TA(s0+1):s,0As0,01Rk,0(As0,0)TA(s0+1):s,0As0,01Rk,0(As0,01)TA(s0+1);s,0T)

Since max1kKRk,0F=O(1),max1inai,02=O(1) we have

max1kKAs0,01Rk,0(As0,01)TFmax1kKAs0,01F2Rk,0F=O(As0,01F2),

and

max1kKs0ssmaxA(s0+1):s,0As0,01Rk,0(As0,01)TFmax1kKs0ssinnxAs0,01F2A(s0+1):s,0FRk,0F=O(smaxAs0,01F2),max1kKs0ssmaxA(s0+1):s,0As0,01Rk,0(As0,01)TA(s0+1):s,0TFmax1kKs0ssmaxAs0,01F2A(s0+1);s,0F2Rk,0F=O(smaxAs0,01F2).

By (21), we have max1kK,s0ssmaxRk(s)*Fωr with probability tending to 1, for any ωr such that ωr/smax.

B.3. Discussion of Condition (A2)

Assume the matrix Ea1,0a1,0T is positive definite. Since s0 is fixed, it follows from the law of large numbers that

1ni=1nai,0ai,0TPEa1,0a1,0T.

Therefore, (A2) holds with probability tending to 1.

Appendix C. Proofs

In the following, we provide proofs of Lemma 1, Lemma 2 and Theorem 1. We define

l0({ai}i,{Rk}k)=Eln(Y;{ai}i,{Rk}k),

for any a1,,ans, R1,,RKs×s and any integer s ≥ 1. Define θijk=(ai*)TRkaj* and θ^ijk=a^iTR^ka^j

C.1. Proof of Lemma 1

Assume there exist some {ai}i,{Rk}k such that

g(ai,0TRk,0aj,0)=g(aiTRkaj),i,j,k.

Since g(·) is strictly monotone, we have

ai,0TRk,0aj,0=aiTRkaj,i,j,k,

or equivalently,

(AR1ARK)AT=(A0R1,0A0RK,0)A0T.

where A=[a1,a2,,an]T. Thus, it follows that

(A0TAR1A0TARK)AT=(A0TA0R1,0A0TA0RK,0)A0T.

By (A0), the matrix A0TA0 is invertible. As a result, we have

((A0TA0)1A0TAR1(A0TA0)1A0TARK)AT=(R1,0RK,0)A0T.

Therefore,

(k=1KRk,0T(A0TA0)1A0TAR1)AT=(k=1KRk,0TRk,0)A0T.

Notice that the matrix k=1KRk,0TRk,0 is invertible under Condition (A0). It follows that

(k=1KRk,0TRk,0)1(k=1KRk,0T(A0TA0)1A0TARk)CAT=A0T.

By Lemma 5.1 in Banerjee and Roy (2014), we have rank(C) ≥ rank(A0) = s0. Therefore, C is invertible. It follows that

A=A0(CT)1, (24)

or equivalently,

ai=C1ai,0,i=1,,n.

By Equation 24, we obtain A0(CT)1RkC1A0T=A0Rk,0A0T,k and hence

A0TA0(CT)1RkC1A0TA0=A0TA0Rk,0A0TA0,k=1,,K.

Since A0TA0 is invertible, this further implies (CT)1RkC1=Rk,0,k, or equivalently,

Rk=CTRk,0C,k=1,,K.

The proof is hence completed.

C.2. Proof of Lemma 2

To prove Lemma 2, we need the following lemma.

Lemma 4 (Mendelson et al. (2008), Lemma 2.3).

Given d ≥ 1, and ε > 0, we have

N(ε,B2d,2)(1+2ε)d,

where B2d is the unit ball in d, and N(ε,,2) the covering number with respect to the Euclidean metric (see Definition 2.2.3 in van der Vaart and Wellner (1996) for details).

Under Condition (Al), we have ln(Y;{ai(s)*}i,{Rk(s)*}k)ln(Y;{a^i(s)}i,{R^k(s)}k). and hence

ln(Y;{ai(s)*}i,{Rk(s)*}k)ln(Y;{a^i(s)}i,{R^k(s)}k). (25)

Besides, we have

maxia^i(s)2ωa,maxkvec(R^k(s))2ωr,s{s0,,smax}, (26)

and

maxiai*2ωa,maxkvec(Rk*)2ωr. (27)

Therefore,

maxi,j,k|(a^i(s))TR^k(s)a^j(s)|(maxia^i(s)22maxkR^k(s)2)(maxia^i(s)22maxkR^k(s)F)(maxia^i(s)22maxkvec(R^k(s))2)ωa2ωr,s{s0,,smax}. (28)

Similarly, we can show

maxi,j,k(ai*)TRk*aj*2ωa2ωr. (29)

We define θijk*=(ai*)TRk*aj*andθ^ijk(s)=(a^i(s))TR^k(s)a^j(s). It follows from a second-order Taylor expansion that

g(θijk*)(θijk*θ^ijk(s))log{1+exp(θijk*)}+log{1+exp(θ^ijk(s))}=exp(θ˜ijk(s))(θijk*θ^ijk(s))22{1+exp(θ˜ijk(s))}2, (30)

for some θ˜ijk(s) lying on the line segment joining θijk* and θ^ijk(s). By (28) and (29), we have for any i, j, k and s{s0,,smax},|θ˜ijk(s)|ωa2ωr. This together with Equation 30 gives that

g(θijk*)θijk*log{1+exp(θijk*)}g(θijk*)θ^ijk(s)+log{1+exp(θ^ijk(s))}(θijk*θ^ijk(s))2exp(ωa2ωr)2{1+exp(ωa2ωr)}2(θijk*θ^ijk(s))2exp(ωa2ωr)8exp(2ωa2ωr)=(θijk*θ^ijk(s))28exp(ωa2ωr),

and hence

l0({ai*}i,{Rk*}k)l0({a^i(s)}i,{R^k(s)}k)=ijk(g(θijk*)(θijk*θ^ijk(s))log{1+exp(θijk*)}{1+exp(θ^ijk(s))})ijk(θijk*θ^ijk(s))28exp(ωa2ωr). (31)

In the following, we provide an upper bound for

maxs{1,,smax}supaia,Rks×smaxiai2ωamaxkvec(Rk)2ωr|l0({ai}i,{Rk}k)ln(Y;{ai}i,{Rk}k)|maxs{1,,smax}supaia,Rks×smaxiai2ωamaxkvec(Rk)2ωr|ijk(Yijkπijk*)aiTRkaj|,

where πijk*=exp(θijk*)/{1+exp(θijk*)},i,j,k.

Let εa=ωa/(nK)2 and {a¯1(s),,a¯Ns,εa(s)} be a minimal εa-net of the vector space ({as:a2ωa},2). It follows from Lemma 4 that

Ns,εa=N(εa,{as:a2ωa},2)=N(1/(nK)2,B2s,2)(1+2n2K2)s(3nK)2s. (32)

Let εr=ωr/(nK)2, and {R¯1(s),,R¯Ns2,εr(s)} be a minimal εr-net of the vector space ({Rs×s:vec(R)2ω},F). For any s × s matrices Q, we have QF=vec(Q)2. Similar to (32), we can show that

Ns2,εr(3nK)2s2. (33)

Hence, for any ai,ajsandRks×s satisfying ai2,aj2ωa,vec(Rk)2ωr there exist some a¯lj(s),a¯lj(s),R¯tk(s), such that

aia¯li(s)2ωa(nK)2,aja¯lj(s)2ωa(nK)2,RkR¯tk(s)Fωr(nK)2. (34)

This further implies

|(ai)TRkaj(a¯li(s))TR¯tk(s)a¯lj(s)||(ai)TRkaj(a¯li(s))TRkaj|+|(a¯li(s))TRkaj(a¯li(s))TR¯tk(s)aj|+|(a¯li(s))TR¯tk(s)aj(a¯li(s))TR¯tk(s)a¯lj(s)|aia¯li(s)2Rk2aj2+a¯li(s)2RkR¯tk(s)2aj2+a¯i(s)2R¯tk(s)2aja¯lj(s)2aia¯li(s)2RkFaj2+a¯li(s)2RkR¯tk(s)Faj+a¯i(s)2R¯tk(s)Faja¯lj(s)22ωa(nK)2ωaωr+ωr(nK)2ωa2=3ωa2ωr(nK)2. (35)

Therefore, we have

maxs{1,,smax}1ssupais,Rks×smaxiai2ωamaxkvec(Rk)2ωr|ijk(Yijkπijk*)aiTRkaj|maxs{1,,smax}1smaxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|+ijk|Yijkπijk*|3ωa2ωrn2K2maxs{1,,smax}1smaxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|+3ωa2ωrK. (36)

By Bernstein’s inequality (van der Vaart and Wellner, 1996, Lemma 2.2.9), we obtain for any t > 0,

maxs{1,,smax}l1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}Pr(|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|>t)2exp(t22σ02+2M0t/3), (37)

where

M0=maxs{1,,smax}maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}maxijk|Yijkπijk*(a¯li(s))TR¯tk(s)a¯lj(s)|,
σ02=maxs{1,,smax}maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}ijkE|Yijkπijk*|2|(a¯li(s))TR¯tk(s)a¯lj(s)|2.

With some calculations, we can show that

M0maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}a¯li(s)2R¯tk(s)2a¯lj(s)2maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}a¯li(s)2R¯tk(s)Fa¯lj(s)2maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}a¯li(s)2vec(R¯tk(s))2a¯lj(s)2ωa2ωr, (38)

and

σ02maxs{1,,smax}maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}ijkE|Yijkπijk*|2a¯li(s)22vec(R¯tk(s))22a¯lj(s)22ωa4ωr2ijkE|Yijkπijk*|2ωa4ωr2n2K. (39)

Let ts=5ωa2ωrsnKmax(n,K)log(nK), we have

maxs{1,,smax}Pr(maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|>ts)
maxs{1,,smax}l1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}Ns,εanNs2,εrKPr(|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|>ts)
maxs{1,,smax}l1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}(3nK)2sn+2s2KPr(|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|>ts)
maxs{1,,smax}2exp(25ωa4ωr2s2n2Kmax(n,K)log(nK)2ωa4ωr2{n2K+snKmax(n,K)log(nK)/3}+(2sn+2s2K)log(3nK))=maxs{1,,smax}2exp(25s2max(n,K)log(nK)2+2smax(n,K)log(nK)/(3nK)+(2sn+2s2K)log(3nK)), (40)

where the first inequality is due to Bonferroni’s inequality, the second inequality follows by (32) and (33), the third inequality is due to (37)-(39).

Under the given conditions, we have ssmax=o{n/log(nK)} and hence

smax(n,K)log(nK)nKsnlog(nK)nK+sKlog(nK)nKsmaxlog(nK)n+smaxlog(nK)n1.

It follows that for any s{1,,smax},

Pr(maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}|ijk(Yijkπijk(s))T(a¯lj(s))TR¯lk(s)a¯lj(s)|>5ωa2ωrsnKmax(n,K)log(nK))2exp(25s2max(n,K)log(nK)2+1+4s2max(n,K)log(3nK))2exp{4s2max(n,K)log(nK)}2exp{4s2nlog(nK)}.

By Bonferroni’s inequality, we have

Pr(s{1,,smax}{maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}|ijk(Yijkπijk*)(a¯li(s))TR¯tk(s)a¯lj(s)|>ts})s=1smax2exp{4s2nlog(nK)}s=1+2exp{4snlog(nK)}2exp{4nlog(nK)}1exp{4nlog(nK)}0.

This together with (36) implies that

maxs{1,,smax}1ssupais,Rks×smaxiai2ωamaxkvec(Rk)2ωr|ijk(Yijkπijk*)aiTRkaj|6ωa2ωrnKmax(n,K)log(nK), (41)

with probability tending to 1. Combining this with (26) and (27), we obtain with probability tending to 1,

max(maxs|l0({a^i(s)}i,{R^k(s)}k)ln(Y;{a^i(s)}i,{R^k(s)}k)|/s, (42)
|l0({ai*}i,{Rk*}k)ln(Y;{ai*}i,{Rk*}k)|/s0)6ωa2ωrnKmax(n,K)log(nK).

Therefore, it follows from (31) that

maxs{s0,,smax}1sijk(θijk*θ^ijk(s))2 (43)
maxs{s0,,smax}8exp(ωa2ωr)s(l0({ai*}i,{Rk*}k)l0({a^i(s)}i,{R^k(s)}k))
maxs{s0,,smax}8exp(ωa2ωr)s|l0({a^i(s)}i,{R^k(s)}k)ln(Y;{a^i(s)}i,{R^k(s)}k)|
+maxs{s0,,smax}8exp(ωa2ωr)s|l0({ai*}i,{Rk*}k)ln(Y;{a^i*}i,{R^k*}k)|
+maxs{s0,,smax}8exp(ωa2ωr)s(ln(Y;{ai*}i,{Rk*}k)ln(Y;{a^i(s)}i,{R^k(s)}k))
maxs{s0,,smax}8exp(ωa2ωr)s|l0({a^i(s)}i,{R^k(s)}k)ln(Y;{a^i(s)}i,{R^k(s)}k)|
+maxs{s0,,smax}8exp(ωa2ωr)s|l0({ai*}i,{Rk*}k)ln(Y;{ai*}i,{Rk*}k)|
96ωa2ωrexp(ωa2ωr)nKmax(n,K)log(nK),

where the third inequality is due to that ln(Y;{a^i(s)}i,{R^k(s)}k)ln(Y;{ai*}i,{Rk*}k), for all s{s0,,smax}.

Let rn,K=(n+K)1/2(logn+logK)1/2. As n, we have

rn,K2nKmax(n,K)log(nK)rn,K2nK(n+K)log(nK)=nK(n+K)log(nK)nK.

Since ωa2ωrexp(ωa2ωr), it follows from (43) that

Pr(maxs{s0,,smax}rn,K2s2exp(2ωa2ωr)ijk(θijk*θ^ijk(s))2nK)0. (44)

For any integer m ≥ 1, define

Sm(s)={({ai(s)}i,{Rk(s)}k):m1<rn,Ksexp(ωa2ωr)ijk{θijk*(ai(s))TRk(s)aj(s)}2m,ai(s)s,i,Rk(s)s×s,k,maxiai(s)2ωa,maxkvec(Rk(s))2ωr}.

For any ({ai(s)}i,{Rk(s)}k)Sm(s), similar to (31), we can show that

l0({ai*}i,{Rk*}k)l0({ai(s)}i,{Rk(s)}k)ijk{θijk*(ai(s))TRk(s)aj(s)}28exp(ωa2ωr)(m1)2s2rn,K28exp(ωa2ωr). (45)

The event ({a^i(s)}i,{R^k(s)}k)Sm(s) implies that

sup({ai(s)}i,{Rk(s)}k)Sm(s)ln(Y;{ai(s)}i,{Rk(s)}k)ln(Y;{ai*}i,{Rk*}k).

It follows from (45) that

sup({ai(s)}i,{Rk(s)}k)Sm(s)|ijk(Yijkπijk*){θijk*(ai(s))TRk(s)aj(s)}|(m1)2s2exp(ωa2ωr)8rn,K2. (46)

For any {li}i and {tk}k satisfying (34), it follows from (35) that

ijk{θijk*(a¯li(s))TR¯tk(s)a¯lj(s)}2
ijk{θijk*(ai(s))TRk(s)aj(s)+(ai(s))TRk(s)aj(s)(a¯li(s))TR¯tk(s)a¯lj(s)}2
2ijk({θijk*(ai(s))TRk(s)aj(s)}2+{(ai(s))TRk(s)aj(s)(a¯li(s))TR¯tk(s)a¯lj(s)}2)2m2s2exp(2ωa2ωr)rn,K2+6ωa2ωrK3m2s2exp(2ωa2ωr)rn,K2. (47)

Let Λm(s)={({li}i,{tk}k):ijk{θijk*(a¯li(s))TR¯tk(s)a¯lj(s)}23m2s2exp(2ωa2ωr)/rn,K2}. Similar to (36), we can show

maxs{s0,,smax}1s2sup({ai(s)}i,{Rk(s)}k)Sm(s)|ijk(Yijkπijk*){θijk*(ai(s))TRk(s)aj(s)}|maxs{s0,,smax}1s2maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}({1i}i,{tk}k)Λm(s)|ijk(Yijkπijk*){(a¯li(s))TR¯tk(s)a¯lj(s)θijk*}|+3ωa2ωrK.

Under the event defined in (46), for any m ≥ 9, we have

maxs{s0,,smax}1s2maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}({1i}i,{tk}k)Λm(s)|ijk(Yijkπijk*){(a¯li(s))TR¯tk(s)a¯lj(s)θijk*}|
(m1)2exp(ωa2ωr)8rn,K23ωa2ωrK(m1)2exp(ωa2ωr)16rn,K2+4exp(ωa2ωr)rn,K23ωa2ωrK(m1)2exp(ωa2ωr)16rn,K2.

Define

(σm(s))2=maxl1,,ln{1,,Ns,εn,K}t1,,tK{1,,Ns2,εn,K}({1i}i,{tk}k)Λm(s)ijkE|Yijkπijk*|2{(a¯li(s))TR¯tk(s)a¯lj(s)θijk*}2.

By (47), it is immediate to see that

(σm(s))23m2s2exp(2ωa2ωr)/rn,K2. (48)

Similar to (37) and (40), we can show there exist some constants J0 > 0, K0 > 0 such that for any mJ0 and any s such that s0ssmax,

Pr(1s2maxl1,,ln{1,,Ns,εa}t1,,tK{1,,Ns2,εr}({1i}i,{tk}k)Λm(s)|ijk(Yijkπijk*){(a¯li(s))TR¯tk(s)a¯lj(s)θijk*}|(m1)2exp(ωa2ωr)16rn,K2)
2exp((m1)4s4exp(2ωa2ωr)/(256rn,K4)2(σm(s))2+(m1)2s2ωa2ωrexp(ωaωr)M0/(24rn,K2)+(2sn+2s2K)log(3nK))
2exp((m1)4s4/(256rn,K4)6m2s2/rn,K2+(m1)2s2/(24rn,K2)+(2sn+2s2K)log(3nK))2exp(K0m2s2rn,K2),

where the second inequality is due to (38), (48) and the fact that ωa2ωrexp(ωa2ωr). This yields

Pr(({a^i(s)}i,{R^k(s)}k)Sm(s))2exp(K0m2s2rn,K2). (49)

Let Jn,K be the integer such that Jn,K1(nK)1/4Jn,K. In view of (44), we have for any mJ0,

Pr(maxs{s0,,smax}rn,Ksexp(ωa2ωr)ijk(θijk*θ^ijk(s))2>m)s=s0smaxl=mJn,K2exp(K0l2s2rn,K2)+o(1)s=1+l=1+2exp(K0lmss0rn,K2)+o(1)2exp(s0mK0rn,K2)12exp(s0mK0rn,K2)+o(1)=o(1).

This implies there exists some constant C0 > 0 such that the following event occurs with probability tending to 1,

maxs{s0,,smax}1s2ijk(θ^ijk(s)θijk*)2C0exp(ωa2ωr)rn,K2, (50)

The proof is hence completed.

C.3. Proof of Lemma 3

For any s < s0, we have

ijk((a^i(s))TR^k(s)a^j(s)(ai*)TRk*aj*)2infa1(s),,an(s)sR1(s),,RK(s)s×sijk((ai(s))TRk(s)aj(s)(ai*)TRk*aj*)2 (51)
=infA(s)n×sR1(s),,RK(s)s×sk=1KA(s)Rk(s)(A(s))TA0Rk,0A0TF2
=infA(s)n×sR1(s),,RK(s)s×s(A(s)R1(s)A(s)RK(s))(A(s))T(A0R1,0A0RK,0)B0A0TF2
infA(s)n×s,B(s)nK×sB(s)(A(s))TB0A0TF2.

Define

(A^(s),B^(s))=argminA(s)n×sB(s)nK×sB(s)(A(s))TB0A0TF2.

The above minimizers are not unique. Notice that rank (B0A0T)rank(A0T)s0.. Assume B0A0T has the following singular value decomposition,

B0A0T=nUnΛnVnT,

for some UnnK×s0,Vnn×s0such thatUnTUn=VnTVn=Is0, and some diagonal matrix

Λn=diag(λn(1),λn(2),,λn(s0))

such that |λn(1)||λn(2)||λn(s0)|. Then one solution is given by

A^(s)=nVnΛnUnTUn(s),B^(s)=Un(s),

where Un(s) is the submatrix of Un formed by its first s columns.

Since UnTUn=Is0, we have

B^(s)(A^(s))T=nUn(s)(Un(s))TUnΛnVnT=nUn(s)(Is,Os,s0s)ΛnVnT=nUn(IsOs,s0sOs0s,sOs0s,s0s)ΛnVnT,

and hence

B^(s)(A^(s))TB0A0T=nUn(IsOs,s0sOs0s,sOs0s,s0s)ΛnVnTnUnIs0ΛnVnT=nUn(OsOs,s0sOs0s,sIs0s,s0s)ΛnVnT.

This together with (51) implies that

ijk((a^i(s))TR^k(s)a^j(s)(ai*)TRk*aj*)2B^(s)(A^(s))TB0A0TF2n2trace(Un(OsOs,s0sOs0s,sIs0s,s0s)ΛnVnTVnΛn(OsOs,s0sOs0s,sIs0s,s0s)UnT)=n2trace(Un(OsOs,s0sOs0s,sIs0s,s0s)Λn2(OsOs,s0sOs0s,sIs0s,s0s)UnT) (52)
=n2trace(Λn(OsOs,s0sOs0s,sIs0s,s0s)UnTUn(OsOs,s0sOs0s,sIs0s,s0s)Λn)=n2trace(Λn(OsOs,s0sOs0s,sIs0s,s0s)Λn)n2(λn(s0))2, (53)

where (52) is due to that VnTVn=Is0 and the equality in (53) is due to that UnTUn=Is0.

To summarize, we’ve shown

ijk((a^i(s))TR^k(s)a^j(s)(ai*)TRk*aj*)2n2(λn(s0))2. (54)

In the following, we provide a lower bound for (λn(s0))2. By definition, (λn(s0))2 is the s0-th largest eigenvalue of

1n2A0B0TB0A0T=1n2k=1KA0Rk,0TA0TA0Rk,0A0T.

We first provide an lower bound for λmin(k=1KRk,0TA0TA0Rk,0/n). Let ΣA=A0TA0/n. Consider the following eigenvalue decomposition:

ΣA=UAΛAUAT,

for some orthogonal matrix UA and some diagonal matrix ΛA. Under Assumption (A2), the matrix

ΣAc¯Is0

is positive semidefinite. As a result, the matrix

k=1KRk,0T(ΣAc¯Is0)Rk,0

is positive semidefinite. Therefore, we have

λmin(k=1KRk,0TΣARk,0)=infa0s0,a02=1a0T(k=1KRk,0TΣARk,0)a0 (55)
=infa0s0,a02=1{a0T(k=1KRk,0T(ΣAc¯Is0)Rk,0)a0+c¯a0T(k=1KRk,0TRk,0)a0}c¯infa0s0,a02=1a0T(k=1KRk,0TRk,0)a0=c¯λmin(k=1KRk,0TRk,0)=c¯K¯.

By the eigenvalue decomposition, we have

k=1KRk,0TΣARk,0=URATΛRAURA,

for some orthogonal matrix ΛRAs0×s0 and some diagonal matrix URAs0×s0. It follows from (55) that all the diagonal elements in ΛRA are positive. Let ΛRA1/2 be the diagonal matrix such that ΛRA1/2ΛRA1/2=ΛRA. Apparently, the diagonal elements in ΛRA1/2 are nonzero. Notice that

1n2k=1KA0Rk,0TA0TA0Rk,0A0T=1nA0URATΛRA1/2URAURATΛRA1/2URAA0T.

The s0 largest eigenvalues in 1n2k=1KA0Rk,0TA0TA0Rk,0A0T corresponds to the smallest eigenvalue in

1nURATΛRA1/2URAA0TA0URATΛRA1/2URA.

Similar to (55), we can show that

λmin(1nURATΛRA1/2URAA0TA0URATΛRA1/2URA)c¯λmin(URATΛRA1/2URAURATΛRA1/2URA)=c¯λmin(URATΛRAURA)c¯2λmin(k=1KRk,0TΣARk,0).

Combining this together with (55), we obtain that

(λn,k(s0))2c¯2K¯.

It follows from (54) that

ijk((a^i(s))TR^k(s)a^j(s)(ai*)TRk*aj*)2n2c¯2K¯.

This completes the proof.

C.4. Proof of Theorem 1

It suffices to show

Pr(IC(s0)>max1s<s0IC(s))1, (56)

and

Pr(IC(s0)>maxs0ssmaxIC(s))1. (57)

We first show (56). Combining Lemma 3 with (31), we obtain that

2l0({ai*}i,{Rk*}k)2l0({a^i(s)}i,{R^k(s)}k)ijk(θijk*θ^ijk(s))24exp(ωa2ωr)n2c¯2K¯4exp(ωa2ωr),

for any s{1,,s01}. Combining this with (42), we have that

2ln(Y;{ai*}i,{Rk*}k)2ln(Y;{a^i(s)}i,{R^k(s)}k)2l0({ai*}i,{Rk*}k)2l0({a^i(s)}i,{R^k(s)}k)2|ln(Y;{ai*}i,{Rk*}k)l0({ai*}i,{Rk*}k)|2|ln(Y;{a^i(s)}i,{R^k(s)}k)l0({a^i(s)}i,{R^k(s)}k)|c¯2n2K¯4exp(ωa2ωr)24ωa2ωrnKmax(n,K)log(nK), (58)

with probability tending to 1.

Under the given conditions, we have K~nl0for some0l01. This together with the condition n(1l0)/2K¯exp(2ωa2ωr)logn yields

ωa2ωrnKmax(n,K)log(nK)=O(exp(ωa2ωr)n3/2+l0/2logn)n2K¯exp(ωa2ωr). (59)

By (58), we have with probability tending to 1 that

2ln(Y;{ai*}i,{Rk*}k)2ln(Y;{a^i(s)}i,{R^k(s)}k)c¯2n2K¯8exp(ωa2ωr), (60)

for all 1 ≤ s < s0. By definition, we have

ln(Y;{a^i(s0)}i,{R^k(s0)}k)ln(Y;{ai*}i,{Rk*}k).

This together with (60) gives that for all 1 ≥ s < s0,

2ln(Y;{a^i(s0)}i,{R^k(s0)}k)2ln(Y;{a^i(s)}i,{R^k(s)}k)c¯2n2K¯8exp(ωa2ωr), (61)

with probability tending to 1.

Under the given conditions, we have κ(n,K)n2K¯/exp(ωa2ωr). Under the event defined in (61), we have that

IC(s0)IC(s)=2ln(Y;{a^i(s0)}i,{R^k(s0)}k)2ln(Y;{a^i(s)}i,{R^k(s)}k)(s0s)κ(n,K)c¯2n2K¯8exp(ωa2ωr)s0κ(n,K)0,

since s0 is fixed. This proves (56).

Now we show (57). Similar to (37)-(42), we can show the following event occurs with probability tending to 1,

sups{s0,,smax}a1(s),,an(s)Ωa,R1(s),,RK(s)Ωrd{{ai(s)}i,{Rk(s)}k;{ai*}i,{Rk*}k)sexp(ωa2ωr)=O(rn,K1nK)1s2|ijk(Yijkπijk*){θijk*(ai(s))TRk(s)aj(s)}|=O(exp(ωa2ωr)rn,K2). (62)

By Lemma 2, we obtain with probability tending to 1,

maxs{s0,,smax}1s2|ijk(Yijkπijk*){θijk*(a^i(s))TR^k(s)a^j(s)}|=O(exp(ωa2ωr)rn,K2),

and hence

maxs{s0,,smax}1s|ijk(Yijkπijk*){θijk*(a^i(s))TR^k(s)a^j(s)}|=O(smaxexp(ωa2ωr)rn,K2). (63)

Since l0({ai*}i,{Rk*}k)l0({ai*}i,{Rk*}k), under the event defined in (63), we have

2ln(Y;{a^i(s)}i,{R^k(s)}k)2ln(Y;{ai*}i,{Rk*}k)2l0({a^i(s)}i,{R^k(s)}k)2l0({ai*}i,{Rk*}k)2|ijk(Yijkπijk*){θijk*(a^i(s))TR^k(s)a^j(s)}|=O(ssmaxexp(ωa2ωr)rn,K2).

Notice that

l0({a^i(s)}i,{R^k(s)}k)l0({ai*}i,{Rk*}k),

we have with probability tending to 1 that

2ln(Y;{a^i(s)}i,{R^k(s)}k)2ln(Y;{a^i(s0)}i,{R^k(s0)}k)O(ssmaxexp(ωa2ωr)rn,K2), (64)

for all s0<ssmax. Under the event defined in (64), we have

IC(s0)IC(s)2ln(Y;{a^i(s0)}i,{R^k(s0)}k)2ln(Y;{a^i(s)}i,{R^k(s)}k)+(ss0)κ(n,K)(ss0)κ(n,K)O(ssmaxexp(ωa2ωr)rn,K2).

Under the condition κ(n,K)smaxexp(ωa2ωr)(n+K)(logn+logK), we have that

Pr(IC(s0)>maxs0<ssmaxIC(s))1,

This proves (57). The proof is hence completed.

C.5. Proof of Corollary 1

Using similar arguments in Lemma 3, we can show for any permutation function π : {1,,n}{1,,n},

ijk((a^i,π(s))TR^k,π(s)a^j,π(s)(ai,π*)TRk,π*aj,π*)2n2c¯2K¯. (65)

In addition, it follows from (41) and the condition K=O(nl0) that

Pr(maxs{1,,smax}π:{1,,n}{1,,n}1s|ijk(Yijkπijk*)(a^i,π(s))TR^k,π(s)a^j,π(s)|C0ωa2ωrn3/2+l0/2logn)1,

and

Pr(1s0|ijk(Yijkπijk*)(ai*)TRk*aj*|C0ωa2ωrn3/2+l0/2logn)1,

for some constant C0 > 0. Hence, the following event occurs with probability tending to 1,

max(maxs,π|l0({a^i,π(s)}i,{R^k,π(s)}k)ln(Y;{a^i,π(s)}i,{R^k,π(s)}k)|/s, (66)
|l0({ai*}i,{Rk*}k)ln(Y;{ai*}i,{Rk*}k)|/s0)C0ωa2ωrnKmax(n,K)log(nK).

Combining (65) with (31) yields

2l0({ai*}i,{Rk*}k)2maxs{1,,s01}π:{1,,n}{1,,n}l0({a^i,π(s)}i,{R^k,π(s)}k)n2c¯2K¯4exp(ωa2ωr). (67)

Under the given conditions, we have

n2c¯2K¯4exp(ωa2ωr)ωa2ωrnKmax(n,K)log(nK).

This together with (66) and (67) gives

2ln(Y;{ai*}i,{Rk*}k)2maxs{1,,s01}π:{1,,n}{1,,n}ln(Y;{a^i,π(s)}i,{R^k,π(s)}k)n2c¯2K¯8exp(ωa2ωr),

with probability tending to 1.

Under Condition (A4), we have ln(Y;{a^i,π(s0)}i,{R^k,π(s0)}k)ln(Y;{ai,π(s0)*}i,{Rk,π(s0)*})=ln(Y;{ai*}i,{Rk*}k). Therefore, the following event occurs with probability tending to 1,

minπ:{1,,n}{1,,n}2(ln(Y;{a^i,π(s0)}i,{R^k,π(s0)}k)maxs{1,,s01}ln(Y;{a^i,π(s)}i,{R^k,π(s)}k))n2c¯2K¯exp(ωa2ωr).

Under the given conditions in Corollary 1, we obtain

Pr(π:{1,,n}{1,,n}{ICπ(s0)>max1ss01ICπ(s)})1. (68)

Similar to (46) and (49), we can show

Pr(π:{1,,n}{1,,n}{({a^i,π(s)},R^k,π(s))Sm(s)})Pr(sup({ai(s)}i,{Rk(s)}k)Sm(s)|ijk(Yijkπijk*){θijk*(ai(s))TRk(s)aj(s)}|(m1)2s2exp(ωa2ωr)8rn,K2)2exp(K0m2s2rn,K2),

for some constant K0 > 0. Using similar arguments in the proof of Lemma 2, this yields with probability tending to 1,

maxs{s0,,smax}π:{1,,n}{1,,n}1s2d2({a^i,π(s)}i,{R^k,π(s)}k;{ai*}i,{Rk*}k)exp(2ωa2ωr)(n+K)(logn+logK)n2K.

This together with (62) gives

maxs{s0,,smax}1s|ijk(Yijkπijk*){θijk*(a^i,π(s))TR^k,π(s)a^j,π(s)}|=O(smaxexp(ωa2ωr)rn,K2),

with probability tending to 1. Using similar arguments in the proof of Theorem 1, we can show

Pr(π:{1,,n}{1,,n}{ICπ(s0)>maxs0<ssmaxICπ(s)})1.

This together with (68) yields (16). The proof is hence completed.

Appendix D. Additional Simulation Results

We simulate the response {Yijk}ijk from the following model:

Pr(Yijk=1|{ai}i,{Rk}k)=exp(aiTRkaj)1+exp(aiTRkaj),a1,a2,,an~iidN(0,{0.5|ij|}i,j=1,,s0),R1=R2==RK=diag(1,1,1,1,,1,1s0).

We use the same six settings as described in Section 4.2. In each setting, we further consider three scenarios, by setting s0 = 2, 4 and 6. Reported in Table 3 and 4 are the percentage of selecting the true models (TP) and the average of s^ selected by IC0, IC0.5, IC1 and BIC over 100 replications.

Table 4:

Simulation results for Setting I, II and III (standard errors in parenthesis)

s0 = 2 s0 = 4 s0 = 6
n = 100, K = 3 TP s^ TP s^ TP s^
IC0 1.00(0.00) 2.00(0.00) 0.96(0.02) 3.98(0.02) 0.88(0.03) 5.87(0.04)
IC0.5 1.00(0.00) 2.00(0.00) 0.96(0.02) 3.98(0.02) 0.88(0.03) 5.87(0.04)
IC1 1.00(0.00) 2.00(0.00) 0.96(0.02) 3.98(0.02) 0.85(0.04) 5.81(0.05)
BIC 0.00(0.00) 11.98(0.01) 0.00(0.00) 11.99(0.01) 0.00(0.00) 12.00(0.00)

n = 150, K = 3 TP s^ TP s^ TP s^
IC0 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.94(0.02) 6.04(0.02)
IC0.5 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.94(0.02) 6.04(0.02)
IC1 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.94(0.02) 6.04(0.02)
BIC 0.00(0.00) 12.00(0.00) 0.00(0.00) 12.00(0.00) 0.00(0.00) 11.99(0.01)

n = 200, K = 3 TP s^ TP s^ TP s^
IC0 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.98(0.01) 6.02(0.01)
IC0.5 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.98(0.01) 6.02(0.01)
IC1 1.00(0.00) 2.00(0.00) 0.97(0.02) 4.03(0.02) 0.98(0.01) 6.02(0.01)
BIC 0.00(0.00) 12.00(0.00) 0.00(0.00) 12.00(0.00) 0.00(0.00) 11.99(0.01)

It can be seen from Table 4 and 5 that our proposed information criteria are consistent. In contrast, BIC fails in all settings. In addition, in the last two settings, TPs of IC0.5 and IC1 are larger than IC0 for most of the cases. As commented before, these differences are due to the finite sample correction term τα(n, K).

Table 5:

Simulation results for Setting IV, V and VI (standard errors in parenthesis)

s0 = 2 s0 = 4 s0 = 6
n = 50, K = 10 TP s^ TP s^ TP s^
IC0 1.00(0.00) 2.00(0.00) 0.96(0.02) 3.98(0.02) 0.73(0.04) 5.83(0.06)
IC0.5 1.00(0.00) 2.00(0.00) 0.95(0.02) 3.97(0.02) 0.69(0.05) 5.77(0.06)
IC1 1.00(0.00) 2.00(0.00) 0.93(0.03) 3.93(0.03) 0.63(0.05) 5.57(0.07)
BIC 0.00(0.00) 11.83(0.05) 0.00(0.00) 11.82(0.04) 0.00(0.00) 11.86(0.04)

n = 50, K = 20 TP s^ TP s^ TP s^
IC0 0.98(0.01) 2.02(0.01) 0.90(0.03) 4.10(0.03) 0.76(0.04) 6.06(0.05)
IC0.5 0.98(0.01) 2.02(0.01) 0.94(0.02) 3.98(0.02) 0.81(0.04) 5.99(0.04)
IC1 0.98(0.01) 2.02(0.01) 0.94(0.02) 3.94(0.02) 0.74(0.04) 5.81(0.05)
BIC 0.00(0.00) 12.00(0.00) 0.00(0.00) 12.00(0.00) 0.00(0.00) 11.99(0.01)

n = 50, K = 50 TP s^ TP s^ TP s^
IC0 0.96(0.02) 2.04(0.02) 0.88(0.03) 4.12(0.03) 0.68(0.05) 6.57(0.13)
IC0.5 0.98(0.01) 2.02(0.01) 0.94(0.02) 4.04(0.02) 0.82(0.04) 6.06(0.04)
IC1 0.98(0.01) 2.02(0.01) 0.94(0.02) 4.02(0.02) 0.74(0.04) 5.75(0.05)
BIC 0.00(0.00) 12.00(0.00) 0.00(0.00) 12.00(0.00) 0.00(0.00) 12.00(0.00)

References

  1. Allen Genevera. Sparse higher-order principal components analysis. In International Conference on Artificial Intelligence and Statistics, pages 27–36, 2012. [Google Scholar]
  2. Audibert Jean-Yves and Tsybakov Alexandre B. Fast learning rates for plug-in classifiers. Ann. Statist, 35(2):608–633, 2007. ISSN 0090–5364. doi: 10.1214/009053606000001217. URL 10.1214/009053606000001217. [DOI] [Google Scholar]
  3. Bai Jushan and Ng Serena. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002. ISSN 0012–9682. doi: 10.1111/1468-0262.00273. URL 10.1111/1468-0262.00273. [DOI] [Google Scholar]
  4. Banerjee Sudipto and Roy Anindya. Linear algebra and matrix analysis for statistics. CRC Press, 2014. [Google Scholar]
  5. Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011. [Google Scholar]
  6. Chi Eric C. and Kolda Tamara G. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl, 33(4):1272–1299, 2012. ISSN 0895–4798. URL https: //doi.org/10.1137/110859063. [Google Scholar]
  7. Fan Yingying and Tang Cheng Yong. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(3):531–552, 2013. ISSN 1369–7412. doi: 10.1111/rssb.12001. URL 10.1111/rssb.12001. [DOI] [Google Scholar]
  8. Galassi Mark, Davies Jim, Theiler James, Gough Brian, Jungman Gerard, Alken Patrick, Booth Michael, Rossi Fabrice, and Ulerich Rhys. GNU Scientific Library Reference Manual (Version 2.1), 2015. URL http://www.gnu.org/software/gsl/. [Google Scholar]
  9. Getoor Lise and Mihalkova Lilyana. Learning statistical models from relational data. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1195–1198. ACM, 2011. [Google Scholar]
  10. Harshman Richard A. Models for analysis of asymmetrical relationships among n objects or stimuli. In Paper presented at the First Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology, Hamilton, Ontario, August, 1978. [Google Scholar]
  11. Harshman Richard A and Lundy Margaret E. Parafac: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994. [Google Scholar]
  12. Kemp Charles, Tenenbaum Joshua B, Griffiths Thomas L, Yamada Takeshi, and Ueda Naonori. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006. [Google Scholar]
  13. Kok Stanley and Domingos Pedro. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433–440. ACM, 2007. [Google Scholar]
  14. Kolda Tamara G. and Bader Brett W. Tensor decompositions and applications. SIAM Rev, 51(3):455–500, 2009. ISSN 0036–1445. URL 10.1137/07070111X. [DOI] [Google Scholar]
  15. Madan Anmol, Cebrian Manuel, Moturu Sai, Farrahi Katayoun, et al. Sensing the” health state” of a community. IEEE Pervasive Computing, 11(4):36–45, 2012. [Google Scholar]
  16. Mendelson Shahar, Pajor Alain, and Tomczak-Jaegermann Nicole. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): 277–289, 2008. [Google Scholar]
  17. Nickel Maximilian. Tensor factorization for relational learning. PhD thesis, the Ludwig-Maximilians-University of Munich, 2013. [Google Scholar]
  18. Nickel Maximilian and Tresp Volker. Logistic tensor factorization for multi-relational data. arXiv preprint arXiv:1306.2084, 2013. [Google Scholar]
  19. Nickel Maximilian, Tresp Volker, and Kriegel Hans-Peter. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 809–816, 2011. [Google Scholar]
  20. Nickel Maximilian, Murphy Kevin, Tresp Volker, and Gabrilovich Evgeniy. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. [Google Scholar]
  21. Nowicki Krzysztof and Snijders Tom A. B. Estimation and prediction for stochastic block-structures. J. Amer. Statist. Assoc, 96(455):1077–1087, 2001. ISSN 0162–1459. doi: 10.1198/016214501753208735. URL 10.1198/016214501753208735. [DOI] [Google Scholar]
  22. Richardson Matthew and Domingos Pedro. Markov logic networks. Machine learning, 62 (1):107–136, 2006. [Google Scholar]
  23. Schwarz Gideon. Estimating the dimension of a model. Ann. Statist, 6(2):461–464, 1978. ISSN 0090–5364. URL http://links.jstor.org/sici?sici=0090-5364(197803)6:2<461:ETDOAM>2.0.CO;2-5&origin=MSN. [Google Scholar]
  24. Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. J. R. Stat. Soc. Ser. B. Stat. Methodol, 79(3):899–916, 2017. ISSN 1369–7412. URL 10.1111/rssb.12190. [DOI] [Google Scholar]
  25. Tsybakov Alexandre B. Optimal aggregation of classifiers in statistical learning. Ann. Statist, 32(1):135–166, 2004. ISSN 0090–5364. doi: 10.1214/aos/1079120131. URL 10.1214/aos/1079120131. [DOI] [Google Scholar]
  26. Tucker Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]
  27. van der Vaart Aad W. and Wellner Jon A. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. ISBN 0–387-94640–3. doi: 10.1007/978-1-4757-2545-2. URL 10.1007/978-1-4757-2545-2. With applications to statistics. [DOI] [Google Scholar]
  28. Wang Yu, Yin Wotao, and Zeng Jinshan. Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2017. [Google Scholar]
  29. Yang Yun and Dunson David B. Bayesian conditional tensor factorizations for highdimensional classification. J. Amer. Statist. Assoc, 111(514):656–669, 2016. ISSN 01621459. URL 10.1080/01621459.2015.1029129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhang Xiang, Wu Yichao, Wang Lan, and Li Runze. A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res, 17(1):1–26, 2016. ISSN 1532–4435. [PMC free article] [PubMed] [Google Scholar]

RESOURCES