Abstract
A number of real-world networks are heterogeneous information networks, which are composed of different types of nodes and links. Numerical prediction in heterogeneous information networks is a challenging but significant area because network based information for unlabeled objects is usually limited to make precise estimations. In this paper, we consider a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductive classification methods and transductive regression models designed for homogeneous networks. The computation of our method is time and space efficient and the precision of our model can be verified by numerical experiments.
1 Introduction
The real world is full of information networks. For these networks, there are some quantities (or attributes) associated with objects (or nodes) that are usually of most interest. A number of information networks are heterogeneous information networks (HIN), for example, the IMDb network which contains movies, actors, directors, writers, and studios as different types of objects. Movies in this network cannot be linked directly while they could be linked by same actors, directors, studios or writers. Different links have different types just as different nodes have different types. Figure 1 is an example of the IMDb network, which uses different shapes and colors to indicate different object and link types.
Figure 1.

An example of heterogeneous information network: IMDb network composed of movies, directors, writers, actors, studios and relationships among them.
Numeric prediction in HIN is important in real-world cases. For example, people may be interested in predicting box office sales of an upcoming movie based on the IMDb network, or predicting the total number of citations of an author based on the DBLP plus citation network. Moreover, we notice that some real-world inductive regression problems can be transformed into transductive learning problems by constructing a network structure. This network structure regularization offers us additional information to overcome the weakness of standard induction regression. In this paper, we will adhere to the transduction setting and develop a numerical prediction method based on HIN.
We notice that numerical prediction in heterogeneous networks has not been thoroughly studied before. However, classification in heterogeneous networks [1–4] and regression in homogeneous networks [5, 6] have been studied. Some homogeneous and heterogeneous graph-based classification methods do provide numerical ‘soft’ predictions before assigning the class labels [1, 2, 7]. However, there are two challenges if we apply these methods directly on the numeric prediction problem: 1) most classification methods arbitrarily set the labels of unlabeled objects to be zeros in the fitting constraint items, which will dramatically shrink the numeric prediction to zero; 2) if unlabeled objects are removed from the fitting constraint items, large variance of prediction could be a problem since numeric prediction is too sensitive to the global network structure. Cortes and Mohri [5] addressed this problem in homogeneous networks where pseudo-labels of unlabeled objects are estimated based on local information and an additional item controlling the distance between predictions and pseudo-labels is applied. Thus based on the philosophy of the regularization framework, we exploit the idea of local estimated labels for unlabeled objects and meta-path based HIN modeling skills to develop a graph regularized meta-path based transductive regression model (Grempt) in HIN. Compared with previous HIN models, we conclude the contributions of our model as following:
Our study is the first one to address the numerical prediction problem in heterogeneous information networks;
The response variable is narrowed to single type of objects, and meta-path and PathSim are used to perceive the similarity between objects;
Local estimated label (pseudo-label) is used to regularize the numerical prediction precision;
The contribution of each type of meta-path which corresponds to specific semantic meaning can be automatically obtained from our model.
We will briefly introduce some related work in Section 2. In Section 3, we will introduce the background of heterogeneous information networks. In section 4, we will introduce our Grempt model and the implementation algorithm. Details and results of the experiment will be described in Section 5. In Section 6, we will provide our conclusion and future directions.
2 Related Work
A straightforward idea to predict unknown attribute of an object in the network is exploiting its neighbors’ information. Relational Neighbor Classifier [8] and Nearest Neighbor Prediction [9] are typical methods with this philosophy.
Another well established prediction method in a homogeneous setting is Kernel Regression, which restricts the search for an appropriate estimator of labeled and unlabeled objects ĥ in Reproducing Kernel Hilbert Space ℋk [10]. Transductive Regression in homogeneous networks can be regarded as a generalization of kernel regression, where the idea of exploiting neighborhood information is also included [5,6].
For heterogeneous networks, some graph-based classification models [1–3] have been proposed. The general framework of these methods is based on the similar assumptions of kernel regression, which has a two-item objective function – the global structure smoothness item and the goodness-of-fit item. However, these classification methods either do not include unlabeled objects in the second item or arbitrarily set the labels of unlabeled objects to be zeros in the fitting constraint items, which may not be suitable for our numeric prediction problem.
3 Background
3.1 Problem Definition
In this study, a heterogeneous information network (HIN) can be defined as a graph G = (V, E), where Xi = {xi1, …, xini}, i = 1, 2, …, t are t types of data objects, and E ={links between any two data objects in V}. If weights of links are specified, G = (V, E) can be extended to be G = (V, E, R), where R ={weights of links in E} and V, E are defined as before. We are interested in particular objects and their associated numerical variable.
Suppose χ={X1, X2, …, Xt}. Given some labels of a numerical variable Y associated with a particular type of objects X* ∈ χ, the problem is to predict this variable for unlabeled objects of this type. Different from standard inductive regression which requires a fully labeled training set to derive a specific function, we consider the transductive setting where the unlabeled objects are involved in the learning procedure and the specific function is not of interest. This problem can be formally defined as:
Definition 1. (Transductive Regression on HIN)
For a given HIN G = (V, E), suppose variable Y is associated with X* ∈ χ. Suppose the number of labeled objects is n and the number of unlabeled objects is m. Given the full space of X* which is composed of n + m objects x1, x2, …, xn, xn+1, …, xn+m, the labeled subspace with Y can be defined as
and remaining objects xn+1, …, xn+m are regarded as unlabeled objects. If the purpose of the learning procedure is to infer yn+1, …, yn+m of unlabeled objects, we call it transductive regression.
3.2 Meta-path and Meta-path Based Similarity
In most cases, it may not be suitable to force the target variable Y to represent the characteristics of all types of objects. For example, among movie, actor, actress, studio, genre, writer and other object types in the IMDb network, box office sales is only suitable to be associated with movie. In addition, because of the diversity of links, HINs usually include a large number of objects and edges. Thus the computational cost is high if all types of objects are considered in the whole learning procedure. Therefore, we need to pre-compute some measures which could represent the type of links and then only focus on our target type of objects in the subsequent procedure.
Meta-path and meta-path based similarity have been studied and applied in several HIN related problems [3,4,11,12]. Our model is to shrink the topology of G = (V, E) based on different types of meta-paths and only keep the objects of interest. Thus we define network schema and topology-shrinking sub-networks in the following paragraphs. Sun et al. defined the network schema as a meta template for a heterogeneous network, and they provided the definition of Meta-path based on this network schema [11]. If 𝒜 denotes object types and ℛ denotes relation types, then a meta-path P can be denoted as A1 → R1 → A2 → R2 → … → Rl → Al+1, where Ai ∈ 𝒜 and Rj ∈ ℛ. This meta-path P indicates a composite relation R = R1 ◦R2 ◦… ◦Rl between types A1 and Al+1, where ◦ denotes the composition operator on relations [11].
For this transductive regression problem, we only consider meta-paths where A1 = Al+1. This is because we are only interested in one certain type T of objects, such as movies in IMDb network and papers in DBLP network. Suppose T is the type of X*. Given one or more meta-paths P1, …, PK in which A1 = Al+1 = T, we can define the topology shrinking sub-network composed of a particular type of objects as the following.
Definition 2. (Topology shrinking Sub-network)
Given a heterogeneous information network G = (V, E) and a type of meta-paths P, the topology shrinking subnetwork of the certain object type T can be denoted as GT = (VT, ET, RT), VT = X*, ET = {euv|pxu⇝xv ∈ P,xu,xv ∈ VT} and RT = {Ruv|Ruv is the weight of euv ∈ ET}, where pxu⇝xv denotes a path instance between xu and xv.
Given a set of meta-paths P1, P2, …, PK, our analysis is based on the corresponding set of associated topology-shrinking sub-networks , k = 1, 2, …, K. For the particular IMDb example in Figure 1, Figure 2 shows four sub-graphs extracted based on four different meta-paths: a) movie-actor-movie; b) movie-director-movie; c) movie-writer-movie; d) movie-studio-movie.
Figure 2.

Four sub-graphs extracted from the IMDb example showed in Figure 1 based on four different meta-paths: a) movie-actor-movie; b) movie-director-movie; c) movie-writer-movie; d) movie-studio-movie.
When we obtain the structure of , what we need to do is to decide the weight of each link. Thus we introduce a meta-path based similarity measure PathSim [11], which can favor objects with strong connectivity and similar visibility, i.e. “peers”, under the given meta-path. Given a symmetric meta-path Pk, PathSim between two objects xu and xv of the same type can be defined as
where pxu⇝xv, pxu⇝xu and pxv⇝xv are path instances. Then for a homogeneous sub-graph , the weight of the link can be defined as the PathSim measure sk(xu,xv) between xu and xv based on the meta-path Pk. If there is no link between an object xu and itself, the weight will be zero. In this study, we use a relation matrix to denote . To simplify computation, we only consider the undirected in subsequent sections and thus R(k) will be symmetric. However, the same procedure can be used to do numerical prediction on directed graph as well.
4 Model
Our graph regularized meta-path based transductive regression model (Grempt) is based on the consistency among network data. In the context of meta-path and similarity measure PathSim, our model follows three principles: 1) predictions of the target variable of two linked objects are likely to be similar, and the tighter the link is, the more similar the two predictions are; 2) predictions of the target variable of labeled objects should be similar to their labels; 3) predictions of the target variable of unlabeled objects should be similar to their local estimated labels (pseudo-labels).
Particularly, pseudo-label is significant in our model since local regularization could be introduced to improve the prediction, which is also the key difference between Grempt and previous HIN models. If only global information is included, the prediction would shrink to the global mean, which might influence the performance on some sparse HINs. However, only using local estimates is not sufficiently robust in network prediction problems. Combining global information and local information can be regarded as a kind of model averaging method which takes advantage of both two types of model, so that it can improve the prediction power based on both global consistency and local consistency.
In this section, we will first introduce the general framework of our model based on these intuitions. Then we will describe the details of estimating pseudo-labels and the algorithm for optimizing the objective function which controls both global and local consistency.
4.1 Global and Local Graph Regularized Framework
This Grempt model is a constraint optimization framework based on the three consistency principles we discussed above. Given K meta-paths P1,P2, …, PK, we can obtain a set of topology-shrinking homogeneous sub-networks of type T, denoted by , k = 1, 2, …, K. We first introduce some notations as following:
yL = (y1, …, yn)T denotes a vector of true labels of labeled objects x1, …, xn;
yU = (yn+1, …, yn+m)T denotes a vector of true labels of unlabeled objects xn+1, …, xn+m;
ỹU = (ỹn+1, …, ỹn+m)T denotes a vector of pseudo-labels of unlabeled objects xn+1, …, xn+m;
w = (w1, …, wk) denotes a vector of weights of subnetworks , k = 1, 2, …, K.
Suppose the estimation of yu from our model is denoted by fu, u = 1, …, n, n+1, …, n+m, then we have following notations:
fL = (f1, …, fn)T denotes estimations of yL;
fU = (fn+1, …, fn+m)T denotes predictions of yU;
Then the objective function in our optimization framework can be defined as
| (4.1) |
In this function, α1 and α2 are two given parameters, and Ω(w;f), C1(fL;yL) and C2(fU;ỹU) are three different loss functions to guarantee the previous three principles respectively.
-
The first item Ω(w; f) in the objective function (4.1) is a composite graph regularization item controlling the global consistency among all the topology-shrinking sub-graphs , k = 1, 2, …, K. It can be defined as
(4.2) where is the summation of u-th row in R(k). This item controls not only the global consistency of each graph but also the consistency of different sub-graphs, where w and f are two sets of unknown variables. With the constraint (4.5), the weight vector w reflects the contribution of each sub-graph 's structure to the consistency of target variable Y.
-
C1(fL;yL) in (4.1) is a loss function controlling the difference between predicted label values fL and given labels yL of labeled objects. In Grempt model, we use a quadratic loss function which can be simply defined as
(4.3) -
Similarly, C2(fU; ỹU) in (4.1) is a loss function controlling the difference between predicted values fU and pseudo-labels ỹU of unlabeled objects. This pseudo-label estimation usually involves location information and thus can be treated as a local consistency constraint. Since errors can be introduced by estimating the pseudo-label as well, not only the raw difference but also the reliability of the pseudo-label which is represented by variance need to be taken into account. Therefore a Mahalanobis-distance-type loss function is used in our Grempt model. Specifically, it can be defined as
(4.4) where is the variance of xn+v's pseudo-label estimation, Σ is a m × m diagonal matrix, the (v, v)-th element of which is . Specific pseudo-label estimation procedure will be introduced later on.
The parameters α1 and α2 control the trade-off among all three items. These two parameters can be assigned based on prior knowledge or determined by cross-validation.
Our target is seeking f and w to minimize this objective function subject to a constraint δ(w) = 0. Specifically in our model, we use the constraint function . This constraint ensures the problem can be converted into a convex optimization problem and closed-form global optimization solution of wk can be obtained. Then this problem can be written as
subject to
| (4.5) |
We thus conclude that the optimization algorithm can be implemented in two stages:
Estimating pseudo-labels ỹU of unlabeled objects and their associated variance using local information;
Given pseudo-labels, optimizing the objective function (4.1) subject to constraint (4.5).
We will introduce details of each stage in next two subsections.
4.2 Pseudo-label Estimation
There are several approaches to determine pseudo-labels. In this study, pseudo-labels are estimated based on the position of unlabeled objects. Specifically, we only consider neighborhood information based on the equal combination of all homogeneous sub-networks 1. The combined relation matrix can be defined as , where the (u,v)-th element is . Then the labeled q-nearest neighborhood based on the combination of subnetworks of an unlabeled object xn+v can be defined as
We use a simple relational model to describe this local information and obtain the pseudo-label of xn+v and the variance of this distribution. Given the labeled q-nearest neighborhood of xn+v, suppose yn+v follows a discrete distribution where for xu ∈ 𝒩q(xn+v),
Then ỹn+v can be assigned to the mean of this distribution, i.e.
| (4.6) |
The variance of this distribution can be calculated as
| (4.7) |
From (4.7) we notice that if xn+v's neighbors’ labels tend to be similar, tends to be small. The pseudo-label of xn+v thus tend to be reliable and vice versa.
Based on above description, pseudo-labels of unlabeled objects ỹU and their associating variances can be directly computed.
4.3 Optimization Procedure
In this section, we will discuss the algorithm used in the optimization procedure.
In the objective function (4.1), the first item can be explained by matrix transformation. For each relation matrix R(k),
D(k) is a diagonal matrix where the (u, u)-th element is the summation of u-th row in R(k);
S(k) =[D(k)]−1/2R(k) [D(k)]−1/2;
ℒ(k) = I − S(k) = I − [D(k)]−1/2R(k) [D(k)]−1/2 is the normalized Laplacian matrix for the topology-shrinking sub-graphs .
Thus we have
| (4.8) |
which indicates that Ω(w;f) can be regarded as a linear combination of normalized Laplacian regularizers based on different sub-graphs .
From (4.8) (4.3) and (4.4), the objective function (4.1) can be re-written as
Then we can use the block coordinate descent approach [13], which will keep reducing the value of the objective function, to iteratively update f and w. The first step is to fix f and update w to minimize J(w;f); the second step is to fix w and update f to minimize J(w;f). We could iteratively do these two steps until convergence. Here we provide two theorems to help us deduce the algorithm, which can be proved using similar techniques in previous graph regularized regression studies [1, 5].
Theorem 4.1. Suppose f is fixed, the objective problem J(w; f) with constraint function δ(w) = 0 is a convex optimization problem. The global optimal solution is given by
| (4.9) |
Theorem 4.2. Suppose w is fixed, the objective problem J(w; f) is a convex optimization problem. The global optimal solution is given by solving the following linear system:
| (4.10) |
Where
is partitioned according labeled and unlabeled objects.
We notice that it is nontrivial to obtain a closed form solution by jointly solving equations (4.9) and (4.10). Although given w, (4.10) can be solved directly, for time and space efficiency, we can provide an iterative algorithm as well, which can be described as
Determine pseudo-label ỹn+v and corresponding variance based on (4.6) and (4.7);
Initialize t = 0, w1(0) = w2(0) = … = wK(0) = log(K), and ;
-
Suppose we have w(t) and f(t), then use f(t) to calculate w(t + 1) based on (4.9), and use w(t + 1) to update f (t + 1) based on the following rules:
where (·)u indicates the u-th element of a vector.
Repeat previous procedure until w(t) and f(t) converge.
4.4 Time Complexity Analysis
We only consider the computational complexity of above iterative method for the objective function optimization in this section. Suppose m + n is the number of objects, K is the number of meta-paths we selected, and are the numbers of edges of topology-shrinking subnetworks given meta-path P1, …, Pk. Suppose the relation matrix R(k) and its corresponding normalized matrix S(k) are pre-computed before the learning procedure. Then for initialization, we need O(m+n+K) time. For each iteration, we need to scan each edge in to do matrix multiplication, scan each object and each type of meta-path to do other arithmetics. Therefore the objective function optimization costs time, where N is the number of iteration. Since the number of objects of topology shrinking sub-networks are much smaller than the number of objects in the original HIN and the iterative procedure converges rapidly (N < 20 in our experiments), our algorithm is time and space scalable.
5 Experiment
5.1 Dataset
We applied our model to two sets of data – data from the Internet Movie Database (IMDb) and data from the DBLP Computer Science Bibliography.
The IMDb data used in this study are extracted from the IMDb interface2 and Box Office Mojo3 affiliated with IMDb. We keep the data related to movies whose names can be exactly matched based on these two sources. For the combined dataset, we only keep the movies released in 2000-2013 with at least 1000 user votes on the IMDb website and related actors, actresses, directors, writers, genres and studios. In this dataset, the target variable is log(box office sales), which is associated with movie. The meta-path we used in the IMDb network are movie-actor-movie (M-A1-M), movie-actress-movie (M-A2-M), movie-director-movie (M-D-M), movie-genre-movie (M-G-M), and movie-writer-movie (M-W-M). Notice that in the experiment, we only keep the actors, actresses, directors, genres and writers, each of which appears in at least two movies, and the movies which are related to these objects. The final IMDb network used in the experiment contains 3300 movies, 18845 actors, 9065 actresses, 746 directors, 20 genres, 197 studios and 1623 writers. To address the temporal nature of movies, we labeled four different sets of movies based their released years. The summary of these four datasets is showed in Table 1.
The DBLP data used in this study are collected by Ar-netMiner4 [14], which contains all papers from DBLP and a fraction of citation relationships between papers. The latest version was updated in Sep. 2013. We keep papers published in 2009-2013, data mining and machine learning related venues5, related authors, venues and citation relationship. In this dataset, the target variable is log (#citation + 1) where for a particular author, #citation is the total citation number of papers he/she published in 2009-2013. We only consider authors who have published papers in 2009 and have published at least two papers in 2009-2013. The meta-path used in the DBLP network are author-paper-author (A-P-A), author-venue-author (A-V-A), and author-paper-(cited by)-paper-(cite)-paper-author (A-P←P→P-A). The final DBLP network used in the experiment contains 3332 authors, 1289 papers, 1046 terms and 27 venues. For DBLP data, we randomly labeled four different sets of authors according to different label proportions. To address the cases where labels are limited, we labeled a small portion of data (10% and 5%) in the last two datasets. The summary of these four datsets is showed in Table 1.
Table 1.
Summary of IMDb datasets (numbers in parentheses indicate released year) and DBLP datasets.
| IMDb | Number of labeled objects | Number of unlabeled objects | Percentage of labeled objects |
|---|---|---|---|
|
| |||
| dataset1 | 3067 (2000–2012) | 233 (2013–2013) | 92.94% |
| dataset2 | 2820 (2000–2011) | 480 (2012–2013) | 85.45% |
| dataset3 | 2578 (2000–2010) | 722 (2011–2013) | 78.12% |
| dataset4 | 2345 (2000–2009) | 955 (2010–2013) | 71.06% |
|
| |||
| DBLP | Number of labeled objects | Number of unlabeled objects | Percentage of labeled objects |
|
| |||
| dataset1 | 3017 | 315 | 90.55% |
| dataset2 | 1666 | 1666 | 50.00% |
| dataset3 | 334 | 2998 | 10.02% |
| dataset4 | 167 | 3165 | 5.01% |
5.2 Preprocessing
For both IMDb data and DBLP data, we are more interested in the log transformation of original response variable box office sales or the number of citation because of their wide ranges. Besides, to easily compare the parameters α1 and α2 in two datasets, we normalize the original label values yu,u = 1, …, n of log(box office sales) and log(#citation + 1) to occupy the unit interval [0, 1] by
and use these values as inputs yu, u = 1, 2, …, n. When the outputs fn+v, v = 1, …, m are obtained, we use the inverse transformation
to transform them back. These transformed predicted values are used in the model evaluation procedure.
5.3 Models For Comparison
We compare our graph regularized meta-path based transductive regression model (Grempt) with six different models – Lasso, RN_ntp, RN, TRnloc_ntp, TRnloc and Grempt_ntp.
Lasso [15]. In order to show the necessity of trans-duction setting in network data, we compare our model to a state-of-the-art inductive regression model – Lasso, which is also regarded as the baseline method in this study. When applying Lasso regression on IMDb data, objects except movies are treated as categorical variables associated with movies. Similarly, for DBLP data, objects except authors and citation relationships are treated as categorical variables.
k-nearest Relational Neighbor Estimation. This relational neighbor prediction model which only involves local estimated labels shares the similar idea to Relational Neighbor Classifier (RN) [8]. However, we only use the k-nearest neighbors to to estimate labels of unlabeled objects, which is the same as the previous pseudo-label estimation method. We consider two different k-nearest RN models – the RN model regardless of different types of meta-path (RN_ntp) and the one in which types of meta-path are considered (RN). For RN_ntp, we treated all types of meta-path as a same type so that the input HIN could be a homogeneous one. Then we calculated relation matrix based on this unified meta-path, and the same as other non-type models used for comparison.
Transductive Regression Without Local Estimation. This transductive regression model without using local estimated labels is equivalent to our Grempt model without the third item i.e. α2 ≡ 0. This two-term objective function is similar to two state-of-the-art HIN classification methods GNetMine [1] and RankClass [2]. Similar with previous method, we also consider two scenarios – all meta-paths are regarded as in the same type (TRnloc_ntp) and different types of meta-paths are involved (TRnloc).
Homogeneous Grempt Model. To validate the different contributions of different meta-paths, we compare our standard Grempt model with a Grempt_ntp model where meta-paths are regarded as in the same type.
5.4 Evaluation Measure
All of the models are evaluated by mean absolute prediction error (MAE), which has the same scale of data and is relative insensitive to outliers. For the unlabeled objects xn+v, v = 1, …, m, we have
5.5 Performance
In RN_ntp, RN and pseudo-label estimation in Grempt_ntp and Grempt, we use 5-nearest neighbors on both IMDb data and DBLP data and equal weighted combination of relation matrix of different homogeneous sub-networks. We notice that label accuracy is very important in our model since MAE decreases as α1 increases on all the datasets. However, the relative importance of the local estimates α2 varies for IMDb data and DBLP data. We notice that for the DBLP network, the importance of pseudo-label varies for different percentages of labeled objects. α2 could be determined based on cross-validation in different datasets. For sparse network like IMDb, local estimates are more important so that we suggest relatively large values as candidates for α2. We also notice that dense network like DBLP with a small percentage of labeled objects has a similar property. For dense network like DBLP with sufficient labeled objects, however, global consistencies are more significant and thus relatively small values for α2 are suggested.
Experimental results on IMDb datasets and DBLP datasets are showed in Table 2 and Table 3 respectively. Here, in TRnloc_ntp, TRnloc, Grempt_ntp and our model Grempt, we set α1 = 2000. For Grempt ntp and Grempt, we set α2 = 3 for all four IMDb datasets, α2 = 0.005 for DBLP dataset1 and dataset2, and α2 = 1 for DBLP dataset3 and dataset4. These parameters for Grempt model are not optimal setting, but are enough to show the superiority of our model. From these two tables, we can conclude that our Grempt model has the best performance on both IMDb datasets and DBLP datasets. Running time of Grempt model are showed in Table 4, which indicates that each single experiment of our method can be executed within seconds.
Table 2.
Results of Prediction Error on IMDb Datasets.
| - | Dataset1 | %label=92.94% | Dataset2 | %label=85.45% |
|---|---|---|---|---|
|
| ||||
| Method | MAE | Improvement | MAE | Improvment |
| Lasso | 2.824 | Baseline | 2.878 | Baseline |
| RN_ntp | 2.104 | 25.49% | 2.163 | 24.85% |
| RN | 2.031 | 28.08% | 2.064 | 28.31% |
| TRnloc_ntp | 2.213 | 21.63% | 2.196 | 23.70% |
| TRnloc | 2.858 | -1.20% | 3.059 | -6.26% |
| Grempt_ntp | 2.095 | 25.82% | 2.144 | 25.52% |
| Grempt | 1.912 | 32.28% | 1.941 | 32.57% |
| - | Dataset3 | %label=78.12% | Dataset4 | %label=71.06% |
|---|---|---|---|---|
|
| ||||
| Method | MAE | Improvement | MAE | Improvment |
| Lasso | 2.929 | Baseline | 2.761 | Baseline |
| RN_ntp | 2.131 | 27.24% | 2.096 | 24.10% |
| RN | 2.079 | 29.02% | 2.025 | 26.65% |
| TRnloc_ntp | 2.230 | 23.85% | 2.232 | 19.19% |
| TRnloc | 3.272 | -11.72% | 3.362 | -21.75% |
| Grempt_ntp | 2.115 | 27.77% | 2.084 | 24.55% |
| Grempt | 1.969 | 32.77% | 1.916 | 30.61% |
Table 3.
Results of Prediction Error on DBLP Datasets.
| - | Dataset1 | %label=90.55% | Dataset2 | %label=50.00% |
|---|---|---|---|---|
|
| ||||
| Method | MAE | Improvement | MAE | Improvment |
| Lasso | 0.7410 | Baseline | 0.8152 | Baseline |
| RN_ntp | 0.6733 | 9.14% | 0.7886 | 3.27% |
| RN | 0.6689 | 9.73% | 0.8196 | -0.54% |
| TRnloc_ntp | 0.8551 | -15.40% | 0.94 | -15.31% |
| TRnloc | 0.6359 | 14.18% | 0.7754 | 4.89% |
| Grempt_ntp | 0.8213 | -10.85% | 0.9139 | -12.11% |
| Grempt | 0.6352 | 14.28% | 0.7745 | 5.00% |
| - | Dataset3 | %label=10.02% | Dataset4 | %label=5.01% |
|---|---|---|---|---|
|
| ||||
| Method | MAE | Improvement | MAE | Improvment |
| Lasso | 1.1935 | Baseline | 0.9673 | Baseline |
| RN_ntp | 0.9217 | 22.78% | 0.958 | 0.96% |
| RN | 0.9631 | 19.31% | 0.9687 | -0.15% |
| TRnloc_ntp | 1.0143 | 15.02% | 1.0788 | -11.53% |
| TRnloc | 0.9533 | 20.13% | 1.0735 | -10.98% |
| Grempt_ntp | 0.9212 | 22.82% | 0.9531 | 1.47% |
| Grempt | 0.9023 | 24.40% | 0.9342 | 3.42% |
Table 4. Running Time of Single Experiment of Grempt Model on IMDb and DBLP Datasets.
| IMDb | Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 |
| Running time(s) | 8.876 | 10.956 | 9.084 | 10.799 |
|
| ||||
| DBLP | Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 |
| Running time(s) | 5.5640 | 6.345 | 6.095 | 6.392 |
Some representative examples from IMDb dataset4 are selected to show the predictions obtained from our model, which are displayed in Table 5. We thus conclude that the Grempt model has the potential to predict the numeric variable in heterogeneous information networks Objects whose predicted values are much different from true values may need to be analyzed case-by-case.
Table 5.
Prediction Examples of log(box office sales) from Grempt model applied on IMDb dataset4.
| IMDb dataset 4 | log(box office sales) | - |
|---|---|---|
| Name | Groundtruth | Prediction |
|
| ||
| The Hobbit: An Unexpected Journey | 19.53 | 18.68 |
| The Hobbit: The Desolation of Smaug | 19.37 | 18.80 |
| The Hunger Games | 19.83 | 17.44 |
| The Hunger Games: Catching Fire | 19.87 | 17.67 |
| Kung Fu Panda 2 | 18.92 | 18.55 |
| Nebraska | 16.69 | 16.27 |
| Before Midnight | 15.91 | 14.71 |
| Shahid | 9.41 | 13.05 |
| Udaan | 8.92 | 13.15 |
We notice that traditional regression methods such as Lasso cannot predict the value of target variable precisely because it lacks the ability to capture the structure information of the network. Methods only using local information (RN_ntp, RN) and methods only using global consistency (TRnloc_ntp, TRnloc) have different performance on IMDb and DBLP datasets because of their different structure characteristics. Since the IMDb network is much sparser than the DBLP network, local information in the IMDb network could be more reliable than global consistency which is reversed in the DBLP network. However, our model can balance these two kinds of consistency so that it can yield a better overall result. In addition, poor performances of RN_ntp, TRnloc_ntp and Grempt_ntp indicate that heterogeneous structures cannot be ignored in graph-based numerical prediction problems.
The vector of weights of different meta-paths w obtained from the iterative algorithm on IMDb data and DBLP data are plotted in Figure 3. It can be concluded that for the IMDb network, movie-actor-movie and movie-actress-movie have more significant influence on the box office sales of a movie than other meta-paths, and movie-genre-movie is the least important among all selected meta-paths. For the DBLP network, author-paper-author and author-venue-author are more significant than author-paper-(cited by)-paper-(cite)-paper-author with respect to the total citation number of an author. Moreover, from Figure 3 we notice that contributions of those important meta-paths will increase as the number of labeled objects decreases, while contributions of relatively unimportant meta-paths will decrease.
Figure 3.
Weights for meta-paths of IMDb datasets and DBLP datasets from Grempt Model.
6 Conclusion and Future Work
In this paper, we proposed a meta-path based transductive regression model in HIN which incorporates the ideas of global graph-based consistency and local estimation. We obtained the best performance among all candidate frameworks for box office sales prediction in IMDb network and total citation number prediction in DBLP network.
There are some potential improvements of this initial research in numerical prediction in HIN. In many real-world cases, people may need more accurate results for important objects, such as blockbuster movies and highly-cited authors. Thus ranking information and preference could be introduced in the transductive regression models. We also notice that some variables may correlate with each other (e.g. box office and rating score). Therefore, another problem could be generalizing this model from univariate case (e.g. predicting box office only) to multivariate case (e.g. predicting box office and rating score jointly) based on correlation between variables.
Acknowledgments
Research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), the Army Research Office under Cooperative Agreement No. W911NF-13-1-0193, National Science Foundation IIS-1017362, IIS-1320617, and IIS-1354329, HDTRA1-10-1-0120, NIH Big Data to Knowledge (BD2K) (U54), and MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC.
Information courtesy of IMDb6 and Box Office Mojo7, used with permission.
Footnotes
Weight of each shrinking homogeneous sub-network can be given by prior knowledge or determined by some validation methods. However, it is usually tricky to tune these parameters. In this study, we use equal combination as a straightforward example and the experiments indicate that it is enough to show the priority of using pseudo-label.
AAAI, CIKM, CVPR, ECIR, ECML, EDBT, ICDE, ICDM, ICML, IJCAI, KDD, PAKDD, PKDD, PODS, SDM, SIGIR, SIGMOD, VLDB, WWW, WSDM, SIGMOD record, ACM trans. database syst., data knowl. eng., data min. knowl. discov., IEEE data eng. bull., IEEE trans. knowl. data eng., j. database manag., journal of machine learning research, machine learning, knowl. inf. syst., SIGKDD explorations, VLDB j.
References
- 1.Ji M, Sun Y, Danilevsky M, Han J, Gao J. Machine Learning and Knowledge Discovery in Databases. Springer; 2010. Graph regularized transductive classification on heterogeneous information networks; pp. 570–586. [Google Scholar]
- 2.Ji M, Han J, Danilevsky M. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. Ranking-based classification of heterogeneous information networks; pp. 1298–1306. [Google Scholar]
- 3.Luo C, Guan R, Wang Z, Lin C. Hetpath-mine: A novel transductive classification algorithm on heterogeneous information networks. Advances in Information Retrieval. 2014:210–221. [Google Scholar]
- 4.Kong X, Yu PS, Ding Y, Wild DJ. Proceedings of the 21st ACM international conference on Information and knowledge management. ACM; 2012. Meta path-based collective classification in heterogeneous information networks; pp. 1567–1571. [Google Scholar]
- 5.Cortes C, Mohri M, Mohri M. On transductive regression. NIPS. 2006:305–312. [Google Scholar]
- 6.Cortes C, Mohri M, Pechyony D, Rastogi A. Proceedings of the 25th international conference on Machine learning. ACM; 2008. Stability of transductive regression algorithms; pp. 176–183. [Google Scholar]
- 7.Zhu X. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison. 2006;2:3. [Google Scholar]
- 8.Macskassy SA, Provost F. A simple relational classifier. DTIC Document, Tech Rep. 2003 [Google Scholar]
- 9.Dudani SA. The distance-weighted k-nearest-neighbor rule. Systems, Man and Cybernetics, IEEE Transactions on. 1976;4:325–327. [Google Scholar]
- 10.Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Vol. 3 Springer; 2004. [Google Scholar]
- 11.Sun Y, Han J, Yan X, Yu PS, Wu T. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. VLDB. 2011;11 [Google Scholar]
- 12.Sun Y, Han J, Aggarwal CC, Chawla NV. Proceedings of the fifth ACM international conference on Web search and data mining. ACM; 2012. When will it happen?: relationship prediction in heterogeneous information networks; pp. 663–672. [Google Scholar]
- 13.Bertsekas DP. Nonlinear programming. 1999 [Google Scholar]
- 14.Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2008. Arnetminer: extraction and mining of academic social networks; pp. 990–998. [Google Scholar]
- 15.Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996:267–288. [Google Scholar]

