Abstract
Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer s, RESCAL computes an s-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering.
The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.
Keywords: Information criteria, Knowledge graph, Model selection consistency, RESCAL model, Statistical relational learning, Tensor factorization
1. Introduction
Relational data is becoming ubiquitous in artificial intelligence and social network analysis. These data sets are in the form of graphs, with nodes and edges representing entities and relationships, respectively. Recently, a number of companies have developed and released their knowledge graphs, including the Google Knowledge Graph, Microsoft Bing’s Satori Knowledge Base, Yandex’s Object Answer, the Linkedln Knowledge Graph, etc. These knowledge graphs are graph-structured knowledge bases that store factual information as relationships between entities. They are created via the automatic extraction of semantic relationships from semi-structured or unstructured text (see Section II.C in Nickel et al., 2016). The data may be incomplete, noisy and contain false information. It is therefore of great importance to infer the existence of a particular relationship to improve the quality of these extracted information.
Statistical relational learning is primarily concerned with learning from relational data sets, and solving tasks such as predicting whether two entities are related (link prediction), identifying equivalent entities (entity resolution), and grouping similar entities based on their relationships (link-based clustering). Statistical relational models can be roughly divided into three categories: the relational graphical models, the latent class models and the tensor factorization models. Relational graphical models include probabilistic relational models (Getoor and Mihalkova, 2011) and Markov logic networks (MLN, Richardson and Domingos, 2006). These models are constructed via Bayesian or Markov networks. In latent class models, each entity is assigned to one of the latent classes and the probability of a relationship between entities depends on their corresponding classes. Two important examples include the stochastic block model (SBM, Nowicki and Snijders, 2001) and the infinite relational model (IRM, Kemp et al., 2006). IRM can be viewed as a nonparametric extension of SBM where the total number of clusters is not prespecified. Both models have received considerable attentions in the statistics and machine learning literature for community detection in networks.
Tensors are multidimensional arrays. Tensor factorization methods such as CANDE-COMP/PARAFAC (CP, Harshman and Lundy, 1994), Tucker (Tucker, 1966) and their extensions have found applications in a variety of fields. Kolda and Bader (2009) presented a thorough overview of tensor decompositions and their applications. Recently, tensor factorizations are being actively studied in the statistics literature and have becoming an emerging field of statistics. To name a few, Chi and Kolda (2012) developed a Poisson tensor factorization model for sparse count data. Yang and Dunson (2016) proposed a conditional tensor factorization model for high-dimensional classification with categorical predictors. Sun et al. (2017) proposed a sparse tensor decomposition method by incorporating a truncation step into the tensor power iteration step.
Relational data sets are typically expressed as (subject, predicate, object) triples and can be grouped as a third-order tensor. As a result, tensor factorization methods can be naturally applied to these data sets. Nickel (2013) proposed a RESCAL factorization model for statistical relational learning. Compared to other tensor factorization approaches such as CP and Tucker methods, RESCAL is more capable of detecting the correlations produced between multiple interconnected nodes. For relational data consisting of n entities, K types of relations, and a positive integer s, RESCAL computes an n × s factor matrix and an s × s × K core tensor. The factor matrix and the core tensor can be further used for link prediction, entity resolution and link-based clustering. Nickel et al. (2011) showed that a linear RESCAL model achieved better or comparable results on common benchmark data sets when compared to other existing methods such as MLN, DEDICOM (Harshman, 1978), IRM, CP, MRC (Kok and Domingos, 2007), etc. It was shown in Nickel and Tresp (2013) that a logistic RESCAL model could further improve the link prediction results.
Central to the empirical validity of RESCAL is the correct specification of the number of latent factors. Nickel et al. (2011) proposed to select this parameter via cross-validation. As commonly known for cross-validation methods, there’s no theoretical guarantee against overestimation. Besides, cross-validation can be computationally expensive, especially for large n and K. In the literature, model selection is less studied for tensor factorization methods. Allen (2012) and Sun et al. (2017) proposed to use Bayesian information criteria (BIC, Schwarz, 1978) for sparse CP decomposition. However, no theoretical results were provided for BIC. Indeed, we show in this paper that a BIC-type criterion may fail for the RESCAL model.
The contribution of this paper is twofold. First, we propose a general class of information criteria for the RESCAL model and prove their model selection consistency. Although we focus on the RESCAL model, our information criteria can be extended to select models for general tensor factorization methods with slight modification. The problem is nonstandard and challenging since both the factor matrix and the core tensor are not observed and need to be estimated. Besides, the model parameters are non-identifiable. Moreover, the derivation of model/tuning parameter selection consistency of information criteria usually relies on the (uniform) consistency of estimated parameters. For example, Fan and Tang (2013) derived the uniform consistency of the maximum likelihood estimators (MLEs) to prove the consistency of GIC (see Proposition 2 in that paper). Zhang et al. (2016) established the uniform consistency of the support vector machine solutions to prove the consistency of SVMICh (see Lemma 2 in that paper). The consistency of these estimators are due to the concavity (convexity) of the likelihood (or the empirical loss) functions. In contrast, for most tensor decomposition models including RESCAL, the likelihood (or the empirical loss) function is usually non-concave (non-convex) and may have multiple local solutions. As a result, the corresponding global maximizer (minimizer) may not be consistent even with the identifiability constraints. It remains unknown how to establish the consistency of the information criterion without consistency of the estimator. A key innovation in our analysis is to design a “proper” pseudometric and show that the global optimum is consistent under this specific pseudometric. We further establish the rate of convergence of the global optimum under this pseudometric as a function of n and K. Based on these results, we establish the consistency of our information criteria when K is either bounded or diverges at a proper rate of n. No parametric assumptions are imposed on the latent factors. Second, we introduce a scalable algorithm for estimating the parameters in the logistic RESCAL model. Despite the fact that a linear RESCAL model can be conveniently solved by an alternating least square algorithm (Nickel et al., 2011), there are lack of optimization algorithms for solving general RESCAL models. The proposed algorithm is based on the alternating direction method of multipliers (ADMM, Boyd et al., 2011) and can be implemented in a parallelized fashion.
The rest of the paper is organized as follows. We formally introduce the RESCAL model and study the parameter identifiability in Section 2. Our information criteria are presented in Section 3 and their model selection properties are investigated. Numerical examples are presented in Section 4 to examine the finite sample performance of the proposed information criteria. Section 5 concludes with a summary and discussion of future extensions. All the proofs are given in the Appendix.
2. The RESCAL Model
This section is structured as follows. We introduce the RESCAL model in Section 2.1. In Section 2.2, we study the identifiability of parameters in the model.
2.1. Model Setup
In knowledge graphs, facts can be expressed in the form of (subject, predicate, object) triples, where subject and object are entities and predicate is the relation between entities. For example, consider the following sentence from Wikipedia:
Jon Snow is a fictional character in the A Song of Ice and Fire series of fantasy novels by American author George R. R. Martin, and its television adaptation Game of Thrones.
The information contained in this message can be summarized into the following set of (subject, predicate, object) triples:
| Subject | Predicate | Object |
|---|---|---|
| Jon Snow | character in | A Song of Ice and Fire |
| Jon Snow | character in | Game of Thrones |
| A Song of Ice and Fire | genre | novel |
| Game of Thrones | genre | television series |
| George R.R. Martin | author of | A Song of Ice and Fire |
| George R.R. Martin | profession | novelist |
In this example, we have a total of 7 entities, 4 types of relations and 6 triples. More generally, let denote the set of all entities and denote the set of all relation types. The number of relations K is either bounded or diverges with n. Assuming non-existing triples indicate false relationships, we can construct a third-order binary tensor
such that
The RESCAL model is defined as follows. For each entity ei, a latent vector is generated. The Yijk’s are assumed to be conditionally independent given all latent factors . Besides, it is assumed that
| (1) |
for some strictly monotone link function g and s0 × s0 matrices . In the above model, ai,0 corresponds to the latent representation of the ith entity and Rk,0 specifies how these ai,0’s interact for the k-th relation. To account for asymmetric relations, we do not restrict Rk,0’s to symmetric matrices. When the relations are symmetric, i.e.,
one can impose the symmetry constraints and obtain a similar derivation.
For continuous Yijk, a related tensor factorization model is the TUCKER-2 decomposition, which decomposes the tensor into
| (2) |
for some and some (random) errors . By Equation 1, RESCAL can be interpreted as a “nonlinear” TUCKER-2 model with the additional constraints that s1 = s2 = s0 and .
CP decomposition is another important tensor factorization method that decomposes a tensor into a sum of rank-1 tensors. It assumes that
for some and . Define and . In view of Equation 2, CP is a special TUCKER-2 model with the constraints that s1 = s2 = s0 and where is a diagonal matrix with the sth diagonal elements being rk,s.
In this paper, the proposed information criteria are designed in particular for the RESCAL model. However, they can be extended to estimate s0 in a more general tensor factorization framework including CP and TUCKER-2 models. We discuss this further in Section 5.
2.2. Identifiability
The parameterization in Equation 1 is not identifiable. To see this, for any nonsingular matrix , we define . Observe that
and hence we have
Let . We impose the following condition.
(A0) (i) Assume A0 has full column rank. (ii) Assume the matrix has full row rank.
(A0)(i) requires the latent factors to be linearly independent. (A0)(ii) holds when at least one of the Rk,0’s has full rank. Under Condition (A0), the following lemma states that the RESCAL model is identifiable up to a nonsingular linear transformation. In Section B.1 of the Appendix, we show (A0) is also necessary to guarantee such identifiability when are symmetric.
Lemma 1 (Identifiability).
Assume (A0) holds. Assume there exist some such that , and
Then, there exists some invertible matrix such that
To fix the nonsingular transformation indeterminacy, we adopt a specific constrained parameterization and focus on estimating and where
where . Observe that
where Is0 stands for an s0 × s0 identity matrix. Therefore, the first ’s are fixed as long as is nonsingular. By Lemma 1, the parameters and are estimable.
From now on, we only consider the logistic link function for simplicity, i.e, g(x) = 1/{1 + exp(−x)}. Results for other link functions can be similarly discussed.
3. Model Selection
Parameters and can be estimated by maximizing the (conditional) log-likelihood function. Since we use the logistic link function, the log-likelihood is equal to
where the first equality is due to the conditional independence assumption.
We assume the number of latent factors s0 is fixed. For any here smax is allowed to diverge with n and satisfies smax ≥ s0, we define the following constrained maximum likelihood estimator
| (3) |
| (4) |
for some , where the vec(·) operator stacks the entries of a matrix into a column vector. To estimate the number of latent factors, we define the following likelihood-based information criteria
for some penalty functions κ(·,·). The estimated number of latent factors is given by
| (5) |
In addition to the constraint in Equation 4, there exist many other constraints that would make the estimators identifiable. The choice of the identifiability constraints might affect the value of IC. However, it wouldn’t affect the value of . Detailed discussions can be found in Section A of the Appendix.
A major technical difficulty in establishing the consistency of IC is due to the nonconcavity of the objective function given in Equation 3. For any , let
be the set of parameters.
For any , we define
With some calculations, we can show that
where . Here, I1 is nonnegative. However, I2 can be negative for some β and ζ. Therefore, the negative Hessian matrix is not positive semidefinite and the likelihood function is not concave. As a result, and may not be consistent to and , even with the identifiability constraints in Equation 4. Here, the presence of I2 is due to the bilinear formulation of the RESCAL model.
Let . Notice that is concave in θijk, ∀i, j, k. This motivates us to consider the following pseudometric:
for any integers s1, s2 > 0 and , , , . Apparently, d(·,·) is nonnegative, symmetric and satisfies the triangle inequality. Below, we establish the convergence rate of
We first introduce some notation. For any s > s0, we define
and
where 0q denotes a q-dimensional zero vector and Op,q is an p × q zero matrix. With a slight abuse of notation, we write and . Clearly, for any s ≥ s0, we have
and hence
Let
where . When is invertible, As,0’s are invertible for all s > s0. The defined ’s satisfy the identifiability constraints in Equation 4 for all s ≥ s0. We make the following assumption.
(A1) Assume and , and s0 ≤ s ≤ smax. In addition, assume , for some .
Lemma 2.
Assume (A1) holds, . Then there exists some constant C0 > 0 such that the following event occurs with probability tending to 1,
Under the condition , we have that
When ωa and ωr are bounded, it follows that
Hence, and are consistent under the pseudometric d for all overfitted models. On the contrary, for underfitted models, we require the following conditions.
(A2) Assume there exists some constant such that .
(A3) Let . Assume .
Lemma 3.
Assume (A2) and (A3) hold. The for any , we have
where and are defined in (A2) and (A3), respectively.
Assumption (A3) holds if there exists some such that
When for some constant c′ > 0, it follows from Lemma 3 that
Based on these results, we establish the consistency of defined in Equation 5 below. For any sequences {an} and {bn}, we write an ~ bn if there exist some universal constants c1, c2 > 0 such that c1an ≤ bn ≤ c2an.
Theorem 1.
Assume (A1)-(A3) hold, or some . Assume κ(n, K) satisfies
| (6) |
Then, we have where is defined in Equation 5.
Let . When are bounded, it follows from Theorem 1 that IC is consistent provided that and . Define
for some α ≥ 0. Note that
| (7) |
Consider the following criteria:
| (8) |
Note that the term satisfies that and . It follows from Equation 7 and Theorem 1 that ICα is consistent for all α ≥ 0. When α > 0, the term adjust the model complexity penalty upwards. We notice that Bai and Ng (2002) used a similar finite sample correction term in their proposed information criteria for approximate factor models. Our simulation studies show that such adjustment is essential to achieve selection consistency for large K.
Conditions (A1) and (A2) are directly imposed on the realizations of . In Section B.2 and B.3, we consider an asymptotic framework where are i.i.d according to some distribution function and show (A1) and (A2) hold with probability tending to 1. Therefore, under this framework, we still have . The consistency of our information criterion remains unchanged.
Observe that we have a total of n × n × K = n2K observations. Consider the following BIC-type criterion:
| (9) |
The model complexity penalty in BIC satisfies
Hence, it does not meet Condition (6) in Theorem 1. As a result, BIC may fail to identity the true model. As shown in our simulation studies, BIC will choose overfitted models and is not selection consistent.
4. Numerical Experiments
This section is organized as follows. In Section 4.1, we introduce our algorithm for computing the maximum likelihood estimators of a logistic RESCAL model. Simulation studies are presented in Section 4.2. In Section 4.3, we apply the proposed information criteria to a real dataset.
4.1. Implementation
In this section, we propose an algorithm for computing and . The algorithm is based upon a 3-block alternating direction method of multipliers (ADMM). Set , , , ’s and ’s are defined by
| (10) |
where
For any , define
Fix , the optimization problem in Equation 10 is equivalent to
We then derive its augmented Lagrangian, which gives us
where ρ > 0 is a penalty parameter and, .
Applying dual descent method yields the following steps, with l denotes the iteration number:
| (11) |
| (12) |
| (13) |
Let us examine Equation 11–13 in more details. In Equation 11, we rewrite the objective function as
Note that Lρ can be represented as a separable sum of functions. As a result, ‘s can be solved in parallel. More specifically, we have
Hence, each can be computed by solving a ridge type logistic regression with responses and covariates .
In Equation 12, each can be independently updated by solving a logistic regression with responses and covariates , i.e,
where ⊗ denotes the Kronecker product.
Similar to Equation 11, each in Equation 13 can be independently computed by solving a ridge type regression with responses and covariates .
Using similar arguments in Theorem 2 in Wang et al. (2017), we can show that the proposed 3-block ADMM algorithm converges for any sufficiently large ρ. In our implementation, we set ρ = nK/2. To guarantee global convergence, we randomly generate multiple initial estimators and solve the optimization problem multiple times based on these initial values.
4.2. Simulations
We simulate from the following model:
where N(0, 1) stands for a standard normal random variable and denotes a q × q diagonal matrix with the jth element equal to vj.
We consider six simulation settings. In the first three settings, we fix K = 3 and set n = 100,150 and 200, respectively. In the last three settings, we increase K to 10, 20, 50, and set n = 50. In each setting, we further consider three scenarios, by setting s0 = 2, 4 and 8. Let smax = 12. The ADMM algorithm proposed in Section 4.1 is implemented in R. Some subroutines of the algorithm are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015) to facilitate the computation. We compare the proposed ICα (see Equation 8) with the BIC-type criterion (see Equation 9). In ICα, we set α = 0, 0.5 and 1. Note that when α = 0, we have
Reported in Table 1 and 2 are the percentage of selecting the true models (TP) and the average of selected by IC0, IC0.5, IC1 and BIC over 100 replications.
Table 1:
Simulation results for Setting I, II and III (standard errors in parenthesis)
| s0 = 2 | s0 = 4 | s0 = 8 | ||||
|---|---|---|---|---|---|---|
| n = 100, K = 3 | TP | TP | TP | |||
| IC0 | 0.97 (0.02) | 2.03 (0.02) | 0.97 (0.02) | 4.03 (0.02) | 0.90(0.03) | 7.90 (0.03) |
| IC0.5 | 0.97 (0.02) | 2.03 (0.02) | 0.98 (0.01) | 4.02 (0.01) | 0.90(0.03) | 7.90 (0.03) |
| IC1 | 0.97 (0.02) | 2.03 (0.02) | 0.98 (0.01) | 4.02 (0.01) | 0.89(0.03) | 7.89 (0.03) |
| BIC | 0.00 (0.00) | 11.99 (0.01) | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 11.99 (0.01) |
|
| ||||||
| n = 150, K = 3 | TP | TP | TP | |||
| IC0 | 0.99 (0.01) | 2.01 (0.01) | 0.97 (0.02) | 4.03 (0.02) | 0.96(0.02) | 8.04 (0.02) |
| IC0.5 | 0.99 (0.01) | 2.01 (0.01) | 0.97 (0.02) | 4.03 (0.02) | 0.96(0.02) | 8.04 (0.02) |
| IC1 | 0.99 (0.01) | 2.01 (0.01) | 0.97 (0.02) | 4.03 (0.02) | 0.96(0.02) | 8.04 (0.02) |
| BIC | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 11.98 (0.01) |
|
| ||||||
| n = 200, K = 3 | TP | TP | TP | |||
| IC0 | 0.99 (0.01) | 2.01 (0.01) | 0.95 (0.02) | 4.05 (0.02) | 0.95(0.02) | 8.05 (0.02) |
| IC0.5 | 0.99 (0.01) | 2.01 (0.01) | 0.95 (0.02) | 4.05 (0.02) | 0.95(0.02) | 8.05 (0.02) |
| IC1 | 0.99 (0.01) | 2.01 (0.01) | 0.95 (0.02) | 4.05 (0.02) | 0.95(0.02) | 8.05 (0.02) |
| BIC | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 11.99 (0.01) | 0.00 (0.00) | 11.98 (0.01) |
Table 2:
Simulation results for Setting IV, V and VI (standard errors in parenthesis)
| s0 = 2 | s0 = 4 | s0 = 8 | ||||
|---|---|---|---|---|---|---|
| n = 50, K = 10 | TP | TP | TP | |||
| IC0 | 1.00 (0.00) | 2.00 (0.00) | 0.97 (0.02) | 4.03 (0.02) | 0.69(0.05) | 7.91 (0.06) |
| IC0.5 | 1.00 (0.00) | 2.00 (0.00) | 0.97 (0.02) | 4.03 (0.02) | 0.66(0.05) | 7.75 (0.06) |
| IC1 | 1.00 (0.00) | 2.00 (0.00) | 0.98 (0.01) | 4.02 (0.01) | 0.60(0.05) | 7.62 (0.06) |
| BIC | 0.00 (0.00) | 11.81 (0.06) | 0.00 (0.00) | 11.60 (0.06) | 0.01 (0.01) | 11.67 (0.07) |
|
| ||||||
| n = 50, K = 20 | TP | TP | TP | |||
| IC0 | 0.97 (0.02) | 2.03 (0.02) | 0.95 (0.02) | 4.05 (0.02) | 0.73(0.04) | 8.46 (0.10) |
| IC0.5 | 0.97 (0.02) | 2.03 (0.02) | 0.98 (0.01) | 4.02 (0.01) | 0.87(0.03) | 8.09 (0.03) |
| IC1 | 0.98 (0.01) | 2.02 (0.02) | 1.00 (0.00) | 4.00 (0.00) | 0.79(0.04) | 7.99 (0.05) |
| BIC | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 11.92 (0.03) | 0.00 (0.00) | 11.99 (0.01) |
|
| ||||||
| n = 50, K = 50 | TP | TP | TP | |||
| IC0 | 0.98 (0.01) | 2.02 (0.01) | 0.93 (0.03) | 4.07 (0.03) | 0.17(0.04) | 11.24 (0.15) |
| IC0.5 | 0.99 (0.01) | 2.01 (0.01) | 0.97 (0.02) | 4.03 (0.02) | 0.76(0.04) | 8.24 (0.05) |
| IC1 | 1.00 (0.00) | 2.00 (0.00) | 0.98 (0.01) | 4.02 (0.01) | 0.79(0.04) | 7.99 (0.05) |
| BIC | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 12.00 (0.00) | 0.00 (0.00) | 11.99 (0.01) |
It can be seen from Table 1 and 2 that BIC fails in all settings. It always selects overfitted models. On the contrary, the proposed information criteria are consistent for most of the settings. For example, under settings where s0 = 2 or 4, TPs of IC0, IC0.5 and IC1 are larger than or equal to 93%. When s0 = 8, expect for the last setting, TPs of the proposed information criteria are no less than 60% for all cases.
IC0, IC0.5 and IC1 perform very similarly for small K. In the first three settings, TPs of these three information criteria are nearly the same for all cases. However, IC0.5 and IC1 are more robust than IC0 for large K. This can be seen in the last scenario of Setting 6, where the TP of IC0 is no more than 20%. Besides, in the last two settings, TP of IC0 is smaller than IC0.5 and IC1 for all cases. These differences are due to the finite sample correction term . As commented before, and increase the model complexity penalty term in IC0.5 and IC1 to avoid overfitting for large K.
In Section D of the Appendix, we examine the performance of our proposed information criteria under the scenario where . Results are similar to those presented at Table 1 and 2.
4.3. Real Data Experiments
In this section, we apply the proposed information criteria to the “Social Evolution” dataset (Madan et al., 2012). This dataset comes from MIT’s Human Dynamics Laboratory. It tracks everyday life of a whole undergraduate MIT dormitory from October 2008 to May 2009. We use the survey data, resulting in n = 84 participants and K = 5 binary relations. The five relations are: close relationship, political discussion, social interaction and two social media interaction.
We compute for and select the number of latent factors using the proposed information criteria and BIC. It turns out that IC0, IC0.5 and IC1 all suggest the presence of 9 factors. In contrast, BIC selects 12 factors. To further evaluate the number of latent factors selected by the proposed information criteria, we consider the following cross-validation procedure. For any , we randomly select 80% of the observations and estimate and by maximizing the observed likelihood function based on these training samples. Then we compute
Based on these predicted probabilities, we calculate the area under the precision-recall curve (AUC) on the remaining 20% testing samples.
Reported in Table 3 are the AUC scores averaged over 100 replications. For any , we denoted by AUCs the corresponding AUC score. It can be seen from Table 3 that AUCs first increases and then decreases as s increases. The maximum AUC score is achieved at s = 10. Observe that AUC9 is very close to AUC10, and it is larger than the remaining AUC scores. This demonstrates that the proposed information criteria select less latent factors while achieve better or similar link prediction results when compared to BIC.
Table 3:
AUC scores
| s | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AUC | 0.7201 | 0.8341 | 0.8952 | 0.9095 | 0.9257 | 0.9364 | 0.9444 | 0.9486 | 0.9513 | 0.9518 | 0.9485 | 0.9467 |
5. Discussion
In this paper, we propose information criteria for selecting the number of latent factors in the RESCAL tensor factorization model and prove their model selection consistency. Although we focus on the logistic RESCAL model, the proposed information criteria can be applied to general tensor factorization models. More specifically, consider the following class of models:
| (14) |
with any of (or without) the following constraints:
(C1) is diagonal;
(C2) for ,
for some strictly increasing function g, and some mean zero random errors .
As commented in Section 2.1, such representation includes the RESCAL, CP and TUCKER-2 models. Specifically, it reduces to the TUCKER-2 model by setting g to be the identity function. If further (C1) holds, then the model in Equation 14 reduces to the CP model. When (C2) holds, it corresponds to the RESCAL model. Consider the following information criteria,
where stands for the likelihood function and are the corresponding (constrained) MLEs. Similar to Theorem 1, we can show that with some properly chosen , IC is consistent under this general setting.
Currently, we assume the tensor Y is completely observed. When some of the Yijk’s are missing, we can calculate ’s and ’s by maximizing the following observed likelihood function
where Nobs denotes the set of the observed responses. The above optimization problem can also be solved by a 3-block ADMM algorithm. Define the following class of information criteria,
where denotes the percentage of observed responses. Consistency of ICobs can be similarly studied.
Acknowledgments
The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.
Appendix A. More on the Identifiability Constraint
Let π(·) be a permutation function of . Alternative to our estimator defined in Equation 3, one may consider
and the corresponding information criteria
Since for any invertible matrix is also the maximizer of subject to the constraint that is invertible. Similarly, the estimator is the maximizer of subject to the constraint that is invertible. As a result, we have IC(s) = ICπ(s) as long as
| (15) |
However, it remains unknown whether Equation 15 holds or not. Hence, there’s no guarantee that IC(s) = ICπ(s). This means the choice of the identifiability constraint might affect the value of our proposed information criterion.
In the following, we prove
| (16) |
This means with probability tending to 1, all the information criteria with different identifiability constraints will select the true model. Therefore, the choice of the identifiability constraint will not affect the performance of our method. For any , let ,
We need the following condition.
(A4) Assume and and any permutation function π(·). In addition, assume , for some .
Corollary 1.
Assume (A2)-(A4) hold, for some . Assume satisfies Condition (6). Then, (16) is satisfied.
Appendix B. More on the Technical Conditions
B.1. Discussion of Condition (A0)
In this section, we show the necessity of (A0) when the matrices are symmetric. More specifically, when (A0) doesn’t hold, we show there exist some and such that
| (17) |
Let’s first consider the case where rank(A0) = s for some s < s0. Thus, it follows that
for some and . Set to be the ith row of and , Equation 17 is thus satisfied. In addition, the new matrix shall have full column rank.
Let . Consider the case where rank(R0) = s for some s < s0. It follows from the singular value decomposition that
| (18) |
for some diagonal matrix and some matrices , that satisfy . Denoted by Uk, 0 the submatrix of U0 formed by rows in and columns in {1, 2,...,s}. It follows from Equation 18 that
Since Rk,0 is symmetric, we have . Notice that . It follows that . Therefore, we have
| (19) |
Define and . In view of Equation 19, it is immediate to see that Equation 17 holds. Since , we have . As a result, are also symmetric. Suppose doesn’t have full column rank. Using the same arguments, we can find some such that
| (20) |
We may repeat this procedure until we find some , ’s satisfy Equation 20 and that the matrix has full column rank.
B.2. Discussion of Condition (A1)
In this section, we consider an asymptotic framework where are i.i.d according to some distribution function and show (A1) holds with probability tending to 1. For any q-dimensional vector , let denote its Euclidean norm. For any m × q matrix Q, stands for the spectral norm of Q while denotes its Frobenius norm. For simplicity, we assume and . Assume is bounded with probability 1. In addition, assume there exist some constants such that
| (21) |
When s0 = 1, the above condition is closely related to the margin assumption (Tsybakov, 2004; Audibert and Tsybakov, 2007) in the classification literature. It automatically holds when a1,0 has a bounded probability density function.
In the following, we show with proper choice of ωa and ωr, (A1) holds with probability tending to 1. By the identifiability constraints, and . When , we have for sufficiently large n,
| (22) |
By the definition of As,0, we have
| (23) |
where .It follows that
Under the given conditions, we have with probability 1 for some constant ω0 > 0. Therefore, we have
By (21), it is immediate to see that
as long as . This together with Equation 22 yields
Combining Equation 23 with the definition of yields
Since we have
and
By (21), we have with probability tending to 1, for any ωr such that .
B.3. Discussion of Condition (A2)
Assume the matrix is positive definite. Since s0 is fixed, it follows from the law of large numbers that
Therefore, (A2) holds with probability tending to 1.
Appendix C. Proofs
In the following, we provide proofs of Lemma 1, Lemma 2 and Theorem 1. We define
for any , and any integer s ≥ 1. Define and
C.1. Proof of Lemma 1
Assume there exist some such that
Since g(·) is strictly monotone, we have
or equivalently,
where . Thus, it follows that
By (A0), the matrix is invertible. As a result, we have
Therefore,
Notice that the matrix is invertible under Condition (A0). It follows that
By Lemma 5.1 in Banerjee and Roy (2014), we have rank(C) ≥ rank(A0) = s0. Therefore, C is invertible. It follows that
| (24) |
or equivalently,
By Equation 24, we obtain and hence
Since is invertible, this further implies or equivalently,
The proof is hence completed.
C.2. Proof of Lemma 2
To prove Lemma 2, we need the following lemma.
Lemma 4 (Mendelson et al. (2008), Lemma 2.3).
Given d ≥ 1, and ε > 0, we have
where is the unit ball in , and the covering number with respect to the Euclidean metric (see Definition 2.2.3 in van der Vaart and Wellner (1996) for details).
Under Condition (Al), we have and hence
| (25) |
Besides, we have
| (26) |
and
| (27) |
Therefore,
| (28) |
Similarly, we can show
| (29) |
We define . It follows from a second-order Taylor expansion that
| (30) |
for some lying on the line segment joining and . By (28) and (29), we have for any i, j, k and . This together with Equation 30 gives that
and hence
| (31) |
In the following, we provide an upper bound for
where .
Let and be a minimal εa-net of the vector space . It follows from Lemma 4 that
| (32) |
Let , and be a minimal εr-net of the vector space . For any s × s matrices Q, we have . Similar to (32), we can show that
| (33) |
Hence, for any satisfying there exist some , such that
| (34) |
This further implies
| (35) |
Therefore, we have
| (36) |
By Bernstein’s inequality (van der Vaart and Wellner, 1996, Lemma 2.2.9), we obtain for any t > 0,
| (37) |
where
With some calculations, we can show that
| (38) |
and
| (39) |
Let , we have
| (40) |
where the first inequality is due to Bonferroni’s inequality, the second inequality follows by (32) and (33), the third inequality is due to (37)-(39).
Under the given conditions, we have and hence
It follows that for any ,
By Bonferroni’s inequality, we have
This together with (36) implies that
| (41) |
with probability tending to 1. Combining this with (26) and (27), we obtain with probability tending to 1,
| (42) |
Therefore, it follows from (31) that
| (43) |
where the third inequality is due to that , for all .
Let . As , we have
Since , it follows from (43) that
| (44) |
For any integer m ≥ 1, define
For any , similar to (31), we can show that
| (45) |
The event implies that
It follows from (45) that
| (46) |
For any {li}i and {tk}k satisfying (34), it follows from (35) that
| (47) |
Let . Similar to (36), we can show
Under the event defined in (46), for any m ≥ 9, we have
Define
By (47), it is immediate to see that
| (48) |
Similar to (37) and (40), we can show there exist some constants J0 > 0, K0 > 0 such that for any m ≥ J0 and any s such that s0 ≤ s ≤ smax,
where the second inequality is due to (38), (48) and the fact that This yields
| (49) |
Let Jn,K be the integer such that In view of (44), we have for any m ≥ J0,
This implies there exists some constant C0 > 0 such that the following event occurs with probability tending to 1,
| (50) |
The proof is hence completed.
C.3. Proof of Lemma 3
For any s < s0, we have
| (51) |
Define
The above minimizers are not unique. Notice that rank . Assume has the following singular value decomposition,
for some and some diagonal matrix
such that Then one solution is given by
where is the submatrix of Un formed by its first s columns.
Since we have
and hence
This together with (51) implies that
| (52) |
| (53) |
where (52) is due to that and the equality in (53) is due to that
To summarize, we’ve shown
| (54) |
In the following, we provide a lower bound for By definition, is the s0-th largest eigenvalue of
We first provide an lower bound for Let Consider the following eigenvalue decomposition:
for some orthogonal matrix UA and some diagonal matrix Under Assumption (A2), the matrix
is positive semidefinite. As a result, the matrix
is positive semidefinite. Therefore, we have
| (55) |
By the eigenvalue decomposition, we have
for some orthogonal matrix and some diagonal matrix . It follows from (55) that all the diagonal elements in are positive. Let be the diagonal matrix such that Apparently, the diagonal elements in are nonzero. Notice that
The s0 largest eigenvalues in corresponds to the smallest eigenvalue in
Similar to (55), we can show that
Combining this together with (55), we obtain that
It follows from (54) that
This completes the proof.
C.4. Proof of Theorem 1
It suffices to show
| (56) |
and
| (57) |
We first show (56). Combining Lemma 3 with (31), we obtain that
for any Combining this with (42), we have that
| (58) |
with probability tending to 1.
Under the given conditions, we have . This together with the condition yields
| (59) |
By (58), we have with probability tending to 1 that
| (60) |
for all 1 ≤ s < s0. By definition, we have
This together with (60) gives that for all 1 ≥ s < s0,
| (61) |
with probability tending to 1.
Under the given conditions, we have Under the event defined in (61), we have that
since s0 is fixed. This proves (56).
Now we show (57). Similar to (37)-(42), we can show the following event occurs with probability tending to 1,
| (62) |
By Lemma 2, we obtain with probability tending to 1,
and hence
| (63) |
Since under the event defined in (63), we have
Notice that
we have with probability tending to 1 that
| (64) |
for all Under the event defined in (64), we have
Under the condition we have that
This proves (57). The proof is hence completed.
C.5. Proof of Corollary 1
Using similar arguments in Lemma 3, we can show for any permutation function π :
| (65) |
In addition, it follows from (41) and the condition that
and
for some constant C0 > 0. Hence, the following event occurs with probability tending to 1,
| (66) |
Combining (65) with (31) yields
| (67) |
Under the given conditions, we have
This together with (66) and (67) gives
with probability tending to 1.
Under Condition (A4), we have Therefore, the following event occurs with probability tending to 1,
Under the given conditions in Corollary 1, we obtain
| (68) |
Similar to (46) and (49), we can show
for some constant K0 > 0. Using similar arguments in the proof of Lemma 2, this yields with probability tending to 1,
This together with (62) gives
with probability tending to 1. Using similar arguments in the proof of Theorem 1, we can show
This together with (68) yields (16). The proof is hence completed.
Appendix D. Additional Simulation Results
We simulate the response from the following model:
We use the same six settings as described in Section 4.2. In each setting, we further consider three scenarios, by setting s0 = 2, 4 and 6. Reported in Table 3 and 4 are the percentage of selecting the true models (TP) and the average of selected by IC0, IC0.5, IC1 and BIC over 100 replications.
Table 4:
Simulation results for Setting I, II and III (standard errors in parenthesis)
| s0 = 2 | s0 = 4 | s0 = 6 | ||||
|---|---|---|---|---|---|---|
| n = 100, K = 3 | TP | TP | TP | |||
| IC0 | 1.00(0.00) | 2.00(0.00) | 0.96(0.02) | 3.98(0.02) | 0.88(0.03) | 5.87(0.04) |
| IC0.5 | 1.00(0.00) | 2.00(0.00) | 0.96(0.02) | 3.98(0.02) | 0.88(0.03) | 5.87(0.04) |
| IC1 | 1.00(0.00) | 2.00(0.00) | 0.96(0.02) | 3.98(0.02) | 0.85(0.04) | 5.81(0.05) |
| BIC | 0.00(0.00) | 11.98(0.01) | 0.00(0.00) | 11.99(0.01) | 0.00(0.00) | 12.00(0.00) |
|
| ||||||
| n = 150, K = 3 | TP | TP | TP | |||
| IC0 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.94(0.02) | 6.04(0.02) |
| IC0.5 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.94(0.02) | 6.04(0.02) |
| IC1 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.94(0.02) | 6.04(0.02) |
| BIC | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 11.99(0.01) |
|
| ||||||
| n = 200, K = 3 | TP | TP | TP | |||
| IC0 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.98(0.01) | 6.02(0.01) |
| IC0.5 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.98(0.01) | 6.02(0.01) |
| IC1 | 1.00(0.00) | 2.00(0.00) | 0.97(0.02) | 4.03(0.02) | 0.98(0.01) | 6.02(0.01) |
| BIC | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 11.99(0.01) |
It can be seen from Table 4 and 5 that our proposed information criteria are consistent. In contrast, BIC fails in all settings. In addition, in the last two settings, TPs of IC0.5 and IC1 are larger than IC0 for most of the cases. As commented before, these differences are due to the finite sample correction term τα(n, K).
Table 5:
Simulation results for Setting IV, V and VI (standard errors in parenthesis)
| s0 = 2 | s0 = 4 | s0 = 6 | ||||
|---|---|---|---|---|---|---|
| n = 50, K = 10 | TP | TP | TP | |||
| IC0 | 1.00(0.00) | 2.00(0.00) | 0.96(0.02) | 3.98(0.02) | 0.73(0.04) | 5.83(0.06) |
| IC0.5 | 1.00(0.00) | 2.00(0.00) | 0.95(0.02) | 3.97(0.02) | 0.69(0.05) | 5.77(0.06) |
| IC1 | 1.00(0.00) | 2.00(0.00) | 0.93(0.03) | 3.93(0.03) | 0.63(0.05) | 5.57(0.07) |
| BIC | 0.00(0.00) | 11.83(0.05) | 0.00(0.00) | 11.82(0.04) | 0.00(0.00) | 11.86(0.04) |
|
| ||||||
| n = 50, K = 20 | TP | TP | TP | |||
| IC0 | 0.98(0.01) | 2.02(0.01) | 0.90(0.03) | 4.10(0.03) | 0.76(0.04) | 6.06(0.05) |
| IC0.5 | 0.98(0.01) | 2.02(0.01) | 0.94(0.02) | 3.98(0.02) | 0.81(0.04) | 5.99(0.04) |
| IC1 | 0.98(0.01) | 2.02(0.01) | 0.94(0.02) | 3.94(0.02) | 0.74(0.04) | 5.81(0.05) |
| BIC | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 11.99(0.01) |
|
| ||||||
| n = 50, K = 50 | TP | TP | TP | |||
| IC0 | 0.96(0.02) | 2.04(0.02) | 0.88(0.03) | 4.12(0.03) | 0.68(0.05) | 6.57(0.13) |
| IC0.5 | 0.98(0.01) | 2.02(0.01) | 0.94(0.02) | 4.04(0.02) | 0.82(0.04) | 6.06(0.04) |
| IC1 | 0.98(0.01) | 2.02(0.01) | 0.94(0.02) | 4.02(0.02) | 0.74(0.04) | 5.75(0.05) |
| BIC | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 12.00(0.00) | 0.00(0.00) | 12.00(0.00) |
References
- Allen Genevera. Sparse higher-order principal components analysis. In International Conference on Artificial Intelligence and Statistics, pages 27–36, 2012. [Google Scholar]
- Audibert Jean-Yves and Tsybakov Alexandre B. Fast learning rates for plug-in classifiers. Ann. Statist, 35(2):608–633, 2007. ISSN 0090–5364. doi: 10.1214/009053606000001217. URL 10.1214/009053606000001217. [DOI] [Google Scholar]
- Bai Jushan and Ng Serena. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002. ISSN 0012–9682. doi: 10.1111/1468-0262.00273. URL 10.1111/1468-0262.00273. [DOI] [Google Scholar]
- Banerjee Sudipto and Roy Anindya. Linear algebra and matrix analysis for statistics. CRC Press, 2014. [Google Scholar]
- Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011. [Google Scholar]
- Chi Eric C. and Kolda Tamara G. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl, 33(4):1272–1299, 2012. ISSN 0895–4798. URL https: //doi.org/10.1137/110859063. [Google Scholar]
- Fan Yingying and Tang Cheng Yong. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(3):531–552, 2013. ISSN 1369–7412. doi: 10.1111/rssb.12001. URL 10.1111/rssb.12001. [DOI] [Google Scholar]
- Galassi Mark, Davies Jim, Theiler James, Gough Brian, Jungman Gerard, Alken Patrick, Booth Michael, Rossi Fabrice, and Ulerich Rhys. GNU Scientific Library Reference Manual (Version 2.1), 2015. URL http://www.gnu.org/software/gsl/. [Google Scholar]
- Getoor Lise and Mihalkova Lilyana. Learning statistical models from relational data. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1195–1198. ACM, 2011. [Google Scholar]
- Harshman Richard A. Models for analysis of asymmetrical relationships among n objects or stimuli. In Paper presented at the First Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology, Hamilton, Ontario, August, 1978. [Google Scholar]
- Harshman Richard A and Lundy Margaret E. Parafac: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994. [Google Scholar]
- Kemp Charles, Tenenbaum Joshua B, Griffiths Thomas L, Yamada Takeshi, and Ueda Naonori. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006. [Google Scholar]
- Kok Stanley and Domingos Pedro. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433–440. ACM, 2007. [Google Scholar]
- Kolda Tamara G. and Bader Brett W. Tensor decompositions and applications. SIAM Rev, 51(3):455–500, 2009. ISSN 0036–1445. URL 10.1137/07070111X. [DOI] [Google Scholar]
- Madan Anmol, Cebrian Manuel, Moturu Sai, Farrahi Katayoun, et al. Sensing the” health state” of a community. IEEE Pervasive Computing, 11(4):36–45, 2012. [Google Scholar]
- Mendelson Shahar, Pajor Alain, and Tomczak-Jaegermann Nicole. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): 277–289, 2008. [Google Scholar]
- Nickel Maximilian. Tensor factorization for relational learning. PhD thesis, the Ludwig-Maximilians-University of Munich, 2013. [Google Scholar]
- Nickel Maximilian and Tresp Volker. Logistic tensor factorization for multi-relational data. arXiv preprint arXiv:1306.2084, 2013. [Google Scholar]
- Nickel Maximilian, Tresp Volker, and Kriegel Hans-Peter. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 809–816, 2011. [Google Scholar]
- Nickel Maximilian, Murphy Kevin, Tresp Volker, and Gabrilovich Evgeniy. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. [Google Scholar]
- Nowicki Krzysztof and Snijders Tom A. B. Estimation and prediction for stochastic block-structures. J. Amer. Statist. Assoc, 96(455):1077–1087, 2001. ISSN 0162–1459. doi: 10.1198/016214501753208735. URL 10.1198/016214501753208735. [DOI] [Google Scholar]
- Richardson Matthew and Domingos Pedro. Markov logic networks. Machine learning, 62 (1):107–136, 2006. [Google Scholar]
- Schwarz Gideon. Estimating the dimension of a model. Ann. Statist, 6(2):461–464, 1978. ISSN 0090–5364. URL http://links.jstor.org/sici?sici=0090-5364(197803)6:2<461:ETDOAM>2.0.CO;2-5&origin=MSN. [Google Scholar]
- Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. J. R. Stat. Soc. Ser. B. Stat. Methodol, 79(3):899–916, 2017. ISSN 1369–7412. URL 10.1111/rssb.12190. [DOI] [Google Scholar]
- Tsybakov Alexandre B. Optimal aggregation of classifiers in statistical learning. Ann. Statist, 32(1):135–166, 2004. ISSN 0090–5364. doi: 10.1214/aos/1079120131. URL 10.1214/aos/1079120131. [DOI] [Google Scholar]
- Tucker Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]
- van der Vaart Aad W. and Wellner Jon A. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. ISBN 0–387-94640–3. doi: 10.1007/978-1-4757-2545-2. URL 10.1007/978-1-4757-2545-2. With applications to statistics. [DOI] [Google Scholar]
- Wang Yu, Yin Wotao, and Zeng Jinshan. Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2017. [Google Scholar]
- Yang Yun and Dunson David B. Bayesian conditional tensor factorizations for highdimensional classification. J. Amer. Statist. Assoc, 111(514):656–669, 2016. ISSN 01621459. URL 10.1080/01621459.2015.1029129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Xiang, Wu Yichao, Wang Lan, and Li Runze. A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res, 17(1):1–26, 2016. ISSN 1532–4435. [PMC free article] [PubMed] [Google Scholar]
