Abstract
Traditional regression methods typically consider only covariate information and assume that the observations are mutually independent samples. However, samples usually come from individuals connected by a network in many modern applications. We present a risk minimization formulation for learning from both covariates and network structure in the context of graph kernel regularization. The formulation involves a loss function with a penalty term. This penalty can be used not only to encourage similarity between linked nodes but also lead to improvement over traditional regression models. Furthermore, the penalty can be used with many loss-based predictive methods, such as linear regression with squared loss and logistic regression with log-likelihood loss. Simulations to evaluate the performance of this model in the cases of low dimensions and high dimensions show that our proposed approach outperforms all other benchmarks. We verify this for uniform graph, nonuniform graph, balanced-sample, and unbalanced-sample datasets. The approach was applied to predicting the response values on a ‘follow’ social network of Tencent Weibo users and on two citation networks (Cora and CiteSeer). Each instance verifies that the proposed method combining covariate information and link structure with the graph kernel regularization can improve predictive performance.
Keywords: Graph regularization, node effect, kernel function, prediction, regression
1. Introduction
Recent decades have seen significant advances in data collection, leading to graph data being collected in many modern applications and corresponding graphs being created that record the relationship of individuals in a system. This information is often collected along with covariate information on each unit of the system [17]. For example, the graph data we focus on in this paper include social and citation networks. A large body of work on traditional regression methods typically takes into consideration only information about covariates and assumes that samples are mutually independent. However, samples usually come from individuals connected by a network. Much literature also pays attention to analyzing the graph structure along with the underlying network, such as work on community detecting [1,6]. Obviously, if the covariate of each node can be used to analyze the network itself, it will be more efficient to find the best communities [2,15,25]. In addition, there is no shortage of methods considering both nodes attributes and graph link connections simultaneously. For example, Zhang et al. [24] put forward a method with the idea of collective inference, which belongs to semi-supervised learning field and is generally applicable to the classification problem. Li et al. [13] proposed the RNC (Regression model with Network Cohesion) model with the idea of network cohesion to solve the problem of network regression. They assume that the nodes in a network have different nodes effect (varying intercepts) and the nodes effect of connected nodes in a network tend to be similar. Based on the assumptions above, Li et al. [13] obtained the RNC model by constructing a regularization term on the nodes effect vector through the network Laplacian matrix. However, the RNC model is only effective in low-dimensional situations because its penalty only acts on the nodes effect without considering the regression parameters. Also, Tibshirani and Land [20] proposed their models with the idea of fusion penalties, which have been used to select variables based on the network-constrained variables [8,10–12,16], but this line of work is not directly relevant to ours because what we are really interested in is networks consisting of observations rather than variables.
On the other hand, the sample dimension in many studies is usually much larger than the number of samples, that is the common high-dimensional low sample problem. In this case, an ordinary linear regression model can not be leveraged directly to estimate the regression parameters. To solve this problem, researchers have proposed many regularization methods for identifying key variables in a regression framework, such as Lasso [19], SCAD [5], Elastic-Net [27], fused Lasso [20], Lars [4], adaptive Lasso [26], group Lasso [22] and hybrid regularization method [7,9,14]. All the above methods have group effect to some extent, the covariates with strong correlation are either selected or eliminated at the same time. However, the above methods have common limitations, that is, these methods are only from the perspective of calculation or algorithm, and do not use any prior knowledge such as graph structure information. In recent years, network penalty usually defined as the quadratic form of Laplacian matrix has been used in a large number of practical applications. For example, Li and Li [11], Chen and Zhang [3] and Wang et al. [21] used the network-based penalty to conduct regression analysis to select key variables on genomic data. Thus, we expect that considering a fused penalty on both covariate and network link information in regression models will be better.
Our task in this paper is to predict the response value using both covariate and graph link information without considering the reason why the linked nodes behave similarly. Different from the classical regression models, we consider the following interpretation: if nodes u and v are linked, then they are more likely to have similar predictive values, that is . This assumption is helpful to construct the regularization term in our proposed model, and it violates almost all guarantees that could provide good performance for the traditional regression models, while we expect it to be helpful in our predictive tasks since it suggests pooling information from neighbors. The objective function of our proposed method can be formulated as a loss function plus a penalty acting on both covariates and graph Laplacian and its explicit solution can be derived under certain conditions. In contrast to previous work, no information about the potential groups of the underlying network is required, and we verify that our proposed model presents an improvement beyond the state of the art local regularized regression models on graph data through analyzing various cases such as a uniform graph, nonuniform graph, balanced-sample, or unbalanced-sample in low-dimensional and high-dimensional situations, especially in high dimensions. Furthermore, the proposed model is effective for both sparse and dense networks, the latter with an extra graph sparsification while keeping good predictive performance.
The rest of our paper is organized as follows. In Section 2, we deduce our model in the setting of linear regression. We frame it as a penalized least squares problem that has a closed-form solution. Then the methodology based on the linear regression can be extended to predictive models with log-likelihood loss function, although we only explicitly investigate the logistic regression. We exhibit a discussion of relevant theoretical properties of the estimator in Section 3. Section 4 gives a brief comparison of the simulated numerical performance for various cases. Section 5 presents a detailed comparison of our approach with several different methods on the representative ‘follow’ network(in Tencent Weibo) and citation networks(Cora and CiteSeer). We apply our method to predict user activity calculated from known statistics on the ‘follow’ network, and we aim to predict the label on the citation networks. Subsequently, a concise summary and future research direction in Section 6 concludes the paper.
2. Regression with graph kernel regularization
We start by introducing the notations. All vectors are treated as column vectors by default. The underlying dataset that we are going to study consists of n observations , where is the vector of covariates and is the response variable for observation. We write for the response variable vector and for the design matrix, which is treated as fixed and whose columns have been standardized to have mean 0 and variance 1. The graph connecting observations is , where is the node set, and is the edge set.The graph structure can be represented by its adjacency matrix , whose element if nodes u and v are linked by an edge, i.e. , and 0 otherwise. We assume that there are no loops, therefore, for any node v and the underlying graph is undirected, that is, . The network Laplacian of is measured by , where represents the diagonal degree matrix with node degree defined as .
2.1. Linear regression with graph kernel regularization
We define our problem more abstractly as follows in the context of linear regression. We wish to predict a real-valued output based on the corresponding covariates and the graph link structure. Assume that the graph-based linear regression model can be constructed as
| (1) |
and its matrix form can be written as
| (2) |
where , and . To be more detailed, is a p-dimensional regression coefficients vector of the covariates and , a column vector containing information that a fixed intercept term cannot explain, represents the individual nodes effect. In other words, we can also treat as a special attribute of each node, which behaves similarly whenever two different nodes are linked. As for μ, it is a small tuning parameter greater than zero to make the system strictly positive and thus to obtain numerical stability. Moreover, no assumption is made about the distribution of the error ε, although we do need to assume that , where represents identity matrix.
Including individual nodes effect rather than a shared intercept is important to considering network similarity. Generally, and , n + p unknown parameters totally, cannot be estimated from n observations without additional assumptions. Thus, we are going to construct a regularization on including and and obtain the general objective function as follows
| (3) |
where represents general loss function and represents the regularization term on . In particular, if takes the square loss function and represents the norm of , Equation (3) is the ridge regression. Also, by defining a proper kernel function, ridge regression can be re-written as
| (4) |
where λ is a tuning parameter and can be selected by cross-validation, represents the vector of predictive values and is the kernel gram matrix with for . Moreover, we shall replace by in order to keep numerical stability since may not be invertible.
In graph-based learning, we can obtain the predictive values by constructing a regularization condition on and solve the following optimization problem:
| (5) |
where is a hyperparameter. If we have a weighted graph with edges E and weights associated with , according to the property of the Laplacian, we have
| (6) |
We just set in this paper, since the identification of weights is not our main topic. But it should be noted that the regularization operator Laplacian can be treated as an inverse of some defined kernel matrix from the aspect of kernel, and exists as long as there are edges in the network, which could be seen from its definition and the fact that Equation (6) is greater than or equal to zero.
We are naturally to combine and with the following two Lemmas based on Zhang et al. [24].
Lemma 2.1
Let , then we have
Lemma 2.2
Let for , then
The proofs of Lemmas 2.1 and 2.2 are given in Appendix A and B. By using them, we can obtain the egression model with raph ernel regularization (RGK) as follows
| (7) |
where The detailed derivation process is shown in Appendix C. In particular, if in Equation (7) takes the square loss, we have
| (8) |
Where denotes the square of norm and the penalty term can be regarded as the fusion of penalties of ridge regression and network, which can not only used to reduce the number of covariates but also control the nodes effect. Furthermore, note that: 1) we avoid computation related to , reducing the computing cost extremely; and 2) Equation (8) is convex in , implying that there is only one global optimal solution, that is
| (9) |
where
In summary, our proposed RGK model combines both covariates and graph information through a linear kernel combination. The objective function of RGK involves a loss function with a regularization term. This regularization can not only be used to compress the number of covariates but also encourage similarity between linked nodes. We expect that it will contribute to improvement over traditional regression models.
2.2. Model prediction and generalization
2.2.1. Prediction model and tuning parameters selection
To obtain the fitted values of the training samples, we just use . However, our task is to predict the response values of the out-of-training samples based on their available covariates and connection relationships. As mentioned above, there are different nodes effect , so predicted nodes effect are needed before obtaining the response values for testing samples. We assume that there are n training samples and test samples, leading to a network containing a total of nodes. Then the corresponding Laplacian can be written as
where corresponds to the n training samples and corresponds to the testing samples. Similarly, write the nodes effect vector as , where is estimated from training samples and needs to be predicted based on the former. With the assumption that connected nodes have similar node effects, we can obtain the nodes effect of the new samples by solving
After taking the derivative, we have
The optimal tuning parameters λ, and μ can be selected by 10-fold cross-validation(CV) on the training data. In the CV, the training samples are randomly split into 10 folds, where nine folds of data and the corresponding induced subgraph are leveraged to train a model for validation on the remaining tenth fold, cycling through each of the ten folds in turn. The cross-validation error is computed as the average of the prediction errors on the fold that was left out, and then the (λ, , μ) grid that minimizes the cross-validation error is selected as the optimal parameters. However, it will take a lot of time to pick the optimal parameters by 10-folds cross-validation grid search if there are many tuning parameters. Therefore, according to Zhang et al. [23], we could just optimize the hyperparameter λ by 10-folds CV, simply take , and further make satisfy without additional optimization. Furthermore, Zhang et al. (2006) also shows that the parameter setting above would not have a great impact on the numerical results, but we believe that there could be a slightly better result with optimized μ and .
2.2.2. Model generalization
We are going to extend the RGK methodology based on the linear regression model to generalized linear regression model or other regression models. Taking for example the logistic regression, we can take the logit function as its link function, and the general form of the link function can be taken as
for any generalized linear model. Suppose the log-likelihood function is . Then if there are edges in a network, we can fit the model by minimizing
| (10) |
When is concave in and , which is the case for exponential families, we can solve this optimization problem through some convex optimization algorithm like Newton-Raphson. Note that the quadratic approximation in Equation (10) is the quadratic approximation to a log-likelihood plus a penalty. This means that the problem can be efficiently solved by iteratively reweighted linear regression with network node effects, just like the generalized linear model is fitted by iteratively reweighted least squares.
3. Theoretical properties of the RGK estimator
Given the RGK estimator presented in Equation (9), it is not difficult to find that exists iff is invertible. Also, we know is a semi-positive matrix from Equation (6). Proposition 3.1 gives an explicit condition for the existence of .
Proposition 3.1
The linear RGK estimator exists whenever
The proof is given in the Appendix D. Theorem 3.2 shows the bias, variance, and mean squared error of the linear RGK estimator based on Proposition 3.1.
Theorem 3.2
The linear RGK estimator defined by Equation (9) satisfies:
The bias of the estimator is given bywhere denotes the bias of an estimator and ;
The variance of the estimator is given by
The mean square error of the estimator is given by
There is an upper bound on the mean square error of the predicted value. That is,where is the Frobenius norm of the matrix .
Theorem 3.2 applies to any fixed n and its proof is given in the Appendix E. Finally, we explore the effect of the graph sparsification when the given graph is dense, which means that both its adjacent matrix and Laplacian are dense matrices. We can leverage the sparsification method proposed in Spielman and Teng [18] to reduce the computing cost. For any , which controls the sparsity degree of the given graph, let be the Laplacian of a network on the same nodes satisfying
| (11) |
Besides, let be the minimizer of , where denotes loss function and . Similarly, let be the minimizer of , where . Theorem 3.3 provides theoretical guarantees for the accuracy of the linear RGK estimator on and .
Theorem 3.3
For , given two Laplacian matrices and satisfying Equation (11), assume is twice differentiable and g is strongly convex with , such that for any ,
Then and satisfy
(12)
The proof is given in the Appendix F. Note that the term is expected to be small, and terms , and are much smaller than the first term. Moreover, the second bound in Equation (12) is typically much greater than the first bound. Thus, we can simplify the equation and obtain a theoretical upper bound of the approximation error that is
Clearly, there is a linear relationship between the squared error bound of estimator and ε. Generally, the range of ε is from zero to one, since the current graph may be disconnected when ε is greater than one.
4. Numerical analysis
Simulations are conducted to investigate the effects of the RGK model on graph data in the context of linear and logistic regression. We illustrate it in low-dimensional and high-dimensional situations respectively and subdivide these into uniform and nonuniform graph cases. We end this section with a graph sparsification example.
4.1. Linear RGK
4.1.1. Low dimension
We conducted experiments on a simulated graph with n nodes assigned to K blocks in case of n>p. The nodes were allocated to some block independently by sampling from a multinomial distribution , and then each block was labeled with . Edges were generated as independent Bernoulli random variables with thereby forming a symmetric probability matrix containing the probabilities of within-block and between-block connections. Here we considered two cases: the uniform graph (the probabilities of nodes belonging to a certain block were equal, i.e. and the nonuniform graph (the probabilities were generated from the uniform distribution). We set , the probability of within-block connection, to 0.8 and of between-block connection to 0.2 for . The generated networks can be visualized in Figure 1. In addition, the covariates were generated from , predictor coefficients from and from , where denotes the expectation of the block and corresponds to its variance. It should be pointed out that our experimental settings are almost consistent with that of the RNC model proposed in Li et al. [13], except that the latter only considers the case of low dimensional uniform network. Also, the objective function of RNC is
where is the vector of nodes effect and λ is a tuning parameter, which can be identified by cross-validation. Therefore, to highlight the novelty more easily, a brief comparison between RNC and RGK models is conducted, including that 1) the RNC model is only effective in low-dimensional situation because of its limited scope of application; and 2) the penalty of RNC model only acts on the nodes effect without considering the regression parameters. However, our proposed RGK model could negate the above two points.
Figure 1.

Graphs for linear low-dimension case (K = 3). (a) Uniform graph and (b) Nonuniform graph.
We compared the performance of our proposed linear RGK with RNC, ridge regression ( ), Lasso and ordinary linear regression (OLS) on each simulated graph in terms of the predictive mean absolute error(MAE). For each model, our simulated data consisted of a training set and a test set with a sample size of 100. Models were fitted on the training set only, and we computed the predictive mean absolute errors on the test data. We repeated the simulations 100 times independently for each model. Moreover, the hyperparameter λ was identified by 10-fold cross-validation, while the tuning parameter μ was just simply fixed as 0.01 and satisfied as mentioned early. Table 1 shows the predictive mean absolute errors of the above methods.
Table 1.
Comparison of MAE (SE) using different models for uniform graph and nonuniform graph cases in the low-dimensional situation.
| Methods | Uniform graph | Nonuniform graph |
|---|---|---|
| Linear RGK | 0.8463 | 0.8071 |
| (0.0382) | (0.0505) | |
| RNC | 0.8491 | 0.8107 |
| (0.0384) | (0.0493) | |
| 0.9024 | 0.8541 | |
| (0.0502) | (0.0447) | |
| Lasso | 0.8966 | 0.8481 |
| (0.0422) | (0.0470) | |
| OLS | 0.8968 | 0.8470 |
| (0.0415) | (0.0466) |
The results reveal that our proposed linear RGK method gave the smallest MAE among the compared procedures above for each case. Specially, both the linear RGK and RNC models are better than other methods, which indicates considering both covariates and link information is proper. Furthermore, the MAE of linear RGK is indeed smaller than that of RNC with almost equal variance, though it is not remarkable, which might be caused by that: (1) the effect of our RGK penalty on covariates is not obvious due to the small number of covariates; (2) the tuning parameters μ and are not optimized. We expect that there will be a significant improvement if we take steps to negate the above two points.
4.1.2. High dimension
In case of p>n, we generated graphs, illustrated in Figure 2, as in the low-dimensional case except each node was accompanied with 1000 covariates. We compared the performance of our linear RGK with ridge regression and Lasso in terms of MAE. Table 2 gives the average results of 50 simulations for each model.
Table 2.
Comparison of MAE (SE) using different models for uniform graph and nonuniform graph cases in the high-dimensional situation.
| Methods | Uniform graph | Nonuniform graph |
|---|---|---|
| Linear RGK | 22.9725 | 23.2285 |
| (1.6939) | (2.0949) | |
| 32.9590 | 33.4138 | |
| (2.5616) | (3.1157) | |
| Lasso | 35.8948 | 36.8282 |
| (2.8278) | (2.7614) |
Figure 2.

Graphs for linear high-dimension case (K = 3). (a) Uniform graph and (b) Nonuniform graph.
We can obtain a direct insight on the result that the linear RGK still gives the smallest MAE, about a third less than the other procedures. Ridge regression performed a little bit better than Lasso with almost equal standard deviations, which might be a result of multicollinearity in the covariates. This high-dimensional case confirms that incorporating both covariate information and network link structure into graph-based linear predictive models can exactly lead to considerable improvement. Moreover, the penalty imposed on these two information sources contributed to better accuracy. We conclude that the linear RGK method performed better than other procedures in low and high dimensions, especially in the latter case.
4.2. Logistic RGK
4.2.1. Low dimension
We conducted simulations to evaluate our proposed logistic RGK model and compared it with RNC, logistic regression(LR), ridge regression, Lasso and OLS in terms of predictive accuracy. We simulated the graphs seen in Figure 3 just as we did for the linear RGK case in the low-dimensional situation. However, the covariates were generated from and the response variable from the Bernoulli distribution with probabilities of success given by the logit function of Therefore, the results can be naturally scaled between 0 and 1. All models were fitted on the same training set, and we computed the predictive recall scores, scores, and AUC values on the test data with a sample size of 100. For each model, we repeated the simulation 50 times. Moreover, it is important to point out that AUC is irrelevant to the threshold, while recall score and score are closely related to the selection of the threshold. We set the threshold to 0.5, which means that a sample is a positive example if its predictive probability is greater than 0.5 and is a negative example otherwise. Table 3 shows the predictive recall scores and scores, and corresponding comparison of ROC curves and AUC values for the models above is shown in Figure 4.
Table 3.
Comparison of predictive recall score and score (SE) using different models for uniform graph and nonuniform graph cases in the low-dimensional situation.
| Uniform graph | Nonuniform graph | |||
|---|---|---|---|---|
| Methods | recall score | score | recall score | score |
| Logistic RGK | 0.7829 | 0.7143 | 0.7840 | 0.7239 |
| (0.1091) | (0.0659) | (0.1018) | (0.0699) | |
| RNC | 0.7053 | 0.7069 | 0.7193 | 0.7212 |
| (0.0953) | (0.0670) | (0.0898) | (0.0700) | |
| LR | 0.6952 | 0.6980 | 0.7041 | 0.7079 |
| (0.0986) | (0.0696) | (0.1010) | (0.0755) | |
| 0.9980 | 0.6727 | 0.9972 | 0.6712 | |
| (0.0062) | (0.0389) | (0.0098) | (0.0522) | |
| Lasso | 0.9975 | 0.6736 | 0.9952 | 0.6722 |
| (0.0068) | (0.0394) | (0.0117) | (0.0530) | |
| OLS | 0.9967 | 0.6736 | 0.9948 | 0.6733 |
| (0.0085) | (0.0393) | (0.0125) | (0.0533) | |
Figure 3.

Graphs for logistic low-dimension case (K = 4). (a) Uniform graph and (b) Nonuniform graph.
Figure 4.
ROC curves comparison for logistic low-dimension case (K = 4). (a) ROC curves of several procedures for uniform graph and (b) ROC curves of several procedures for nonuniform graph.
From the comparison seen in Table 3 and Figure 4, it is easy to see that our proposed logistic RGK can perform better than other benchmarks. The logistic RGK gave the largest score and AUC value, which indicates that our logistic RGK procedure had a better classification capability with threshold 0.5 on both the uniform graph and nonuniform graph data when n>p. Moreover, we can also find that although the scores of ridge regression, Lasso and OLS are not so good, their recall scores are significantly bigger than other models, almost to 1, implying that their ability to recognize the minority class is better but there are also many misclassification cases.
4.2.2. High dimension
Simulations were conducted to evaluate the performance of the logistic RGK when p>n and to compare it with logistic regression, ridge regression and Lasso in terms of predictive classification accuracy. The way to generate graphs was consistent with that in Section 4.2.1, while the covariates with a size of 1000 were sampled independently from . We repeated the simulations 50 times for each model, and generated graphs in Figure 5. Additionally, Table 4 shows the predictive recall scores and scores. The corresponding ROC curves and AUC values for the above models are shown in Figure 6.
Figure 5.

Graphs for logistic high-dimension case (K = 4). (a) Uniform graph and (b) Nonuniform graph.
Table 4.
Comparison of predictive recall score and score (SE) using different models for uniform graph and nonuniform graph cases in the high-dimensional situation.
| Uniform graph | Nonuniform graph | |||
|---|---|---|---|---|
| Methods | recall score | score | recall score | score |
| Logistic RGK | 0.9822 | 0.6622 | 0.9783 | 0.6616 |
| (0.0167) | (0.0891) | (0.0191) | (0.0904) | |
| LR | 0.5081 | 0.4981 | 0.5033 | 0.4958 |
| (0.0751) | (0.0700) | (0.0624) | (0.0812) | |
| 0.9993 | 0.6540 | 0.9972 | 0.6544 | |
| (0.0018) | (0.0924) | (0.0098) | (0.0522) | |
| Lasso | 0.9982 | 0.6552 | 0.9969 | 0.6546 |
| (0.0092) | (0.0911) | (0.0107) | (0.0922) | |
Figure 6.
ROC curves comparison for logistic high-dimension case (K = 4). (a) ROC curves of several procedures for uniform graph and (b) ROC curves of several procedures for nonuniform graph.
In terms of score and AUC value, we observed that logistic RGK outperformed all benchmark models in both uniform and nonuniform networks. Moreover, the recall scores were extremely similar to those in the low-dimensional case. All in all, the logistic RGK approach was better than all benchmarks on the simulated graph datasets both in the low-dimensional and high-dimensional situations, especially in the latter.
4.3. Linear RGK on the sparsified graph
This subsection uses a simple example to illustrate the graph sparsification approach in the linear regression. According to Spielman and Teng [18], we can find , a sparsified graph adjacency matrix, so that a corresponding Laplacian matrix can be found satisfying Equation (11) with a nonnegative constant ε. To understand the sparsification process further, we refer readers to [18]. In this experiment, we generated a network with 1000 nodes. In addition, we set ε to 1, and the other settings were as in the uniform graph case of linear RGK in the low-dimensional situation. After calculation, we obtained that the average degree of the original network and corresponding sparsified network are 167.498 and 55.114, respectively, and the approximation quality value, a measurable index proposed in [18], is 0.93.
Next, we compared the performance between the linear RGK with original and sparsified using two aspects. Firstly, the time cost of the original adjacency matrix is twice that of the corresponding sparsified version. Secondly, the performance with the latter matrix is almost identical to that with the former in terms of the predictive mean absolute error, which implies that the sparsified captures most of the structural information that the original graph contains. Therefore, when given a dense graph with covariates in the predictive task, we could use the sparsified to replace the original , thereby reducing computation cost while maintaining good performance.
5. Real data analysis
We investigated the performance of our proposed RGK on real graph data, including linear RGK on a ‘follow’ network (in Tencent Weibo users) and logistic RGK on citation networks (Cora and CiteSeer).
5.1. Linear RGK on social networks
The users in the ‘follow’ network, numbered in the millions, are provided with rich information (e.g. demographics, profile keywords, follow history, action data) to generate a good prediction model. In this application, the follow history is used to form an undirected social network. The action data, including statistics about the ‘at’ ( ) actions between the users and the number of tweets, retweets, and comments in a period of 90 days, can be used to describe user behavior. The profile keywords containing the keywords extracted from the tweet/retweet/comments of a user can be used as one s features in our prediction model.
5.1.1. Low dimension
Our task is to predict the users activity from their covariates and the current network. The response value is calculated based on the action data of each individual, that is, the summation of ‘at’ ( ) actions between the users and the number of tweets, retweets, and comments, each divided by the length of time. Note that we just selected a subgraph as our data in this experiment since the original data is too huge; this subgraph consisted of 1652 nodes with 8546 covariates. Furthermore, what we investigated here was in the low-dimensional context, so some dimension reduction method was needed to reduce the number of covariates. We adopted the random forest(RF) to arrive at this goal, and we were left with 30 covariates according to the importance rank of RF. Therefore, our focus was concentrated on this selected network.
We compared the performance of linear RGK with RNC, ridge regression and OLS. We did five runs and reported the averages and standard deviations in terms of MAE. Specially, we held out 451 samples as the test samples and fitted all the models on the rest data. In addition, the way to determining the tuning parameters is exactly the same as before. Table 5 shows the predictive MAE of the methods above.
Table 5.
Comparison of MAE (SE) using different models in the low-dimensional situation.
| Linear RGK | RNC | OLS | ||
|---|---|---|---|---|
| MAE | 5.4313 | 7.9881 | 5.8204 | 6.1457 |
| (0.3712) | (0.5830) | (0.3615) | (0.9944) |
Table 5 shows that our linear RGK outperformed all benchmarks in terms of the MAE. Note that both ridge regression and OLS performed better than RNC, which might be due to the following two facts: (1) the penalty of RNC is only imposed on the nodes effect without constraints on the regression parameters; (2) the ‘follow’ network is so sparsified that the nodes' attributes may be more useful than the graph link information for model effect. Moreover, ridge regression outperformed OLS slightly, implying that there could be multicollinearity in the covariates. All the analysis above indicates that imposing the penalty on covariates and graph structure is reasonable.
5.1.2. High dimension
In this section, all 8546 covariates were taken into consideration and other settings were like that in Section 5.1.1. We compared the performance of the linear RGK with Lasso and ridge regression. For each method, we performed five runs and reported the results on the test data. Table 6 shows the results of these different models.
Table 6.
Comparison of MAE (SE) using different models in the high-dimensional situation.
| Linear RGK | Lasso | ||
|---|---|---|---|
| MAE | 5.4930 | 5.6967 | 5.7550 |
| (0.2758) | (0.2869) | (0.3015) |
Clearly, the linear RGK gave the smallest MAE with the smallest standard deviation. It is not surprising that the performance difference between these models above is not explicitly remarkable, which could be caused by: (1) We did not optimize the tuning parameters μ and using cross-validation. (2) The attributes of each node often carry more information than the graph link structure, especially when the network is extremely sparsified. So far, we have verified that our proposed linear RGK outperforms the benchmarks consistently on both simulated and real graph data.
5.2. Logistic RGK on citation networks
Cora consists of machine learning papers tagged with one of these labels: case-based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, or theory. Similarly, papers in CiteSeer are tagged with Agents, AI, DB, IR, ML, or HCI. In addition, these papers were selected in a way such that each paper was quoted or cited by others in the final corpus, leading to 2708 papers in the whole corpus of Cora and 3312 in CiteSeer. After stemming and removing stopwords, we were left with a vocabulary of 500 unique words for the former and 3703 for the latter. All words with document frequency less than 10 were removed in these two networks.
5.2.1. Low dimension
Considering the logistic RGK is suitable for binary classification and the labels in Cora are unbalanced, we selected binary subnetworks, respectively, in cases of the balanced-sample and unbalanced-sample. Specially, we chose nodes labeled theory and case-based for the former and nodes tagged rule learnings and probabilistic methods for the latter.
We compared the performance of our proposed logistic RGK with RNC and logistic regression in terms of the predictive error and accuracy. The two networks were segmented into a training set and a test set with a sample size proportion of , respectively. Models were fitted on the same training set, and we computed the log losses, recall scores and scores of these models on the same test set. For each model, we performed five trials and reported the averaged predictive results. Table 7 shows the detailed results in balanced-sample and unbalanced-sample cases.
Table 7.
Comparison of predictive log loss, recall score, and score (SE) using different methods for balanced-sample and unbalanced-sample cases in the low-dimensional situation.
| Balanced samples | Unbalanced samples | |||||
|---|---|---|---|---|---|---|
| Methods | Log loss | Recall score | score | Log loss | Recall score | score |
| Logistic RGK | 0.7605 | 0.7955 | 0.5641 | 0.7752 | 0.7701 | 0.5883 |
| (0.0056) | (0.0393) | (0.0207) | (0.0185) | (0.0264) | (0.0279) | |
| RNC | 4.3978 | 0.5169 | 0.4891 | 4.7006 | 0.3876 | 0.3600 |
| (0.5065) | (0.0360) | (0.0366) | (0.4476) | (0.0588) | (0.0455) | |
| LR | 5.4068 | 0.5707 | 0.5378 | 5.6020 | 0.4983 | 0.4461 |
| (0.3381) | (0.0380) | (0.0226) | (0.2807) | (0.0540) | (0.0398) | |
Obviously, the logistic RGK gave a smaller log loss than RNC and logistic regression in both the balanced-sample and unbalanced-sample cases. Moreover, we found that the logistic RGK was under-confident, while the other two methods were over-confident through comparing the predictive probabilities of each model. It is because the probabilities of the former were mainly between 0.5 and 0.7, while the latter were extremely close to 0 or 1. Moreover, the logistic RGK also performed better in terms of score and recall score, showing that the logistic RGK is a better classifier.
5.2.2. High dimension
Like the case in Cora, We selected nodes labeled Agents and ML as the final dataset in the balanced-sample case and nodes labeled AI and IR for the unbalanced-sample case. The experimental setting is completely identical to that in the Cora except that here we only compared the performance of the logistic RGK with ordinary logistic regression, due to RNC being inappropriate for high-dimensional data. Table 8 gives the detailed comparative results for the balanced-sample and unbalanced-sample cases.
Table 8.
Comparison of predictive log loss, recall score, and score (SE) using Logistic RGK and logistic regression(LR) for balanced-sample and unbalanced-sample cases in the high-dimensional situation.
| Balanced samples | Unbalanced samples | |||||
|---|---|---|---|---|---|---|
| Methods | Log loss | Recall score | score | Log loss | Recall score | score |
| Logistic RGK | 0.8403 | 0.9960 | 0.6406 | 0.8992 | 0.9933 | 0.5769 |
| (0.0087) | (0.0036) | (0.0136) | (0.0368) | (0.0149) | (0.0671) | |
| LR | 6.7880 | 0.6098 | 0.5269 | 5.9456 | 0.5275 | 0.4171 |
| (0.5569) | (0.2199) | (0.0566) | (0.3827) | (0.0901) | (0.0842) | |
Table 8 shows that the logistic RGK outperformed logistic regression remarkably in terms of the predictive log loss, recall score, and score both in the balanced-sample and unbalanced-sample cases. To sum up, we conclude that our proposed logistic RGK model outperforms other methods on these two citation networks.
6. Conclusion and future research
This paper introduces a novel method based on the graph regularization formulation of the kernel method for learning from both covariates and graph link structure. The method is applicable to regression problems for graph data. The objective function of our model is a well-formed convex optimization problem, which means that a globally optimal solution can be computed efficiently. In addition, imposing a penalty on both covariates and the graph Laplacian not only leads to a reduction of covariates but also provides a clear trade-off between the bias and variance. Moreover, the proposed RGK model is somewhat similar to a standard kernel method, with an appropriately defined kernel based on the underlying graph. Experimental results on both simulated graph data and real networks in various situations indicate that the RGK model can lead to better performance in graph regression tasks.
In the future research, we are going to extend the RGK methodology to other situations such as Cox s model and to investigate the performance of RGK on other kinds of networks such as random networks, which is significantly different from what we did here, treating the underlying graph as fixed. We would expect that applying our combined graph kernel regularization to random networks could lead to a significant improvement compared to the previous methods.
Supplementary Material
Funding Statement
This work was supported by the National Natural Science Foundation of China (NSFC) [grant number 71771201], [grant number 71874171], [grant number 71731010], [grant number 71631006], [grant number 71991464], [grant number 71871208] and [grant number 72071193], Anhui Provincial Natural Science Foundation, the Ministry of Education Humanities and Social Science Project of China [grant number 20YJA910006], Natural Science Foundation of Jiangsu Province (No. BK20201396), and Natural Science Foundation of the Higher Education Institutions of Jiangsu Province [grant number 19KJA180003].
Disclosure statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- 1.Abbe E., Community detection and stochastic block models: Recent developments, J. Mach. Learn. Res. 18 (2017), pp. 6446–6531. [Google Scholar]
- 2.Binkiewicz N., Vogelstein J.T., and Rohe K., Covariate-assisted spectral clustering, Biometrika 104 (2017), pp. 361–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chen J. and Zhang S., Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data, Bioinformatics 32 (2016), pp. 1724–1732. [DOI] [PubMed] [Google Scholar]
- 4.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Stat. 32 (2004), pp. 407–499. [Google Scholar]
- 5.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
- 6.Goldenberg A., Zheng A.X., Fienberg S.E., and Airoldi E.M., A Survey of Statistical Network Models, Now Publishers Inc, 2010. [Google Scholar]
- 7.Huang H.H. and Liang Y., Hybrid method for gene selection in the cox proportional hazards model, Comput. Methods. Programs. Biomed. 164 (2018), pp. 65–73. [DOI] [PubMed] [Google Scholar]
- 8.Huang H.H. and Liang Y., An integrative analysis system of gene expression using self-paced learning and scad-net, Expert. Syst. Appl. 135 (2019), pp. 102–112. [Google Scholar]
- 9.Huang H.H., Liu X.Y., and Liang Y., Feature selection and cancer classification via sparse logistic regression with the hybrid regularization, PLoS. ONE. 11 (2016), pp. e0149675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kim S., Pan W., and Shen X., Network-based penalized regression with application to genomic data, Biometrics 69 (2013), pp. 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li C. and Li H., Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008), pp. 1175–1182. [DOI] [PubMed] [Google Scholar]
- 12.Li C. and Li H., Variable selection and regression analysis for graph-structured covariates with an application to genomics, Ann. Appl. Stat. 4 (2010), pp. 1498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Li T., Levina E., and Zhu J., Prediction models for network-linked data, Ann. Appl. Stat. 13 (2019), pp. 132–164. [Google Scholar]
- 14.Liang Y., Liu C., Luan X.Z., Leung K.S., Chan T.M., Xu Z.B., and Zhang H., Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification, BMC. Bioinformatics. 14 (2013), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Newman M.E. and Clauset A., Structure and inference in annotated networks, Nat. Commun. 7 (2016), pp. 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pan W., Xie B., and Shen X., Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010), pp. 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pearson M. and West P., Drifting smoke rings, Connections 25 (2003), pp. 59–76. [Google Scholar]
- 18.Spielman D.A. and Teng S.H., Spectral sparsification of graphs, SIAM J. Comput. 40 (2011), pp. 981–1025. [Google Scholar]
- 19.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Statist. Soc. Ser. B (Methodological) 58 (1996), pp. 267–288. [Google Scholar]
- 20.Tibshirani R., Saunders M., Rosset S., Zhu J., and Knight K., Sparsity and smoothness via the fused lasso, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 67 (2005), pp. 91–108. [Google Scholar]
- 21.Wang R., Su C., Wang X., Fu Q., Gao X., Zhang C., Yang J., Yang X., and Wei M., Global gene expression analysis combined with a genomics approach for the identification of signal transduction networks involved in postnatal mouse myocardial proliferation and development, Int. J. Mol. Med. 41 (2018), pp. 311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yuan M. and Lin Y., Model selection and estimation in regression with grouped variables, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 68 (2006), pp. 49–67. [Google Scholar]
- 23.Zhang T. and Ando R.K., Analysis of spectral kernel design based semi-supervised learning, in Advances in Neural Information Processing Systems, 2006, pp. 1601–1608.
- 24.Zhang T., Popescul A., and Dom B., Linear prediction models with graph regularization for web-page categorization, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 821–826.
- 25.Zhang Y., Levina E., and Zhu J., Community detection in networks with node features, Electron. J. Stat. 10 (2016), pp. 3153–3178. [Google Scholar]
- 26.Zou H., The adaptive lasso and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. [Google Scholar]
- 27.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 67 (2005), pp. 301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


