Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2022 Jan 31;50(6):1400–1417. doi: 10.1080/02664763.2022.2028745

Prediction models with graph kernel regularization for network data

Jie Liu a, Haojie Chen a, Yang Yang b,CONTACT
PMCID: PMC10071950  PMID: 37025276

Abstract

Traditional regression methods typically consider only covariate information and assume that the observations are mutually independent samples. However, samples usually come from individuals connected by a network in many modern applications. We present a risk minimization formulation for learning from both covariates and network structure in the context of graph kernel regularization. The formulation involves a loss function with a penalty term. This penalty can be used not only to encourage similarity between linked nodes but also lead to improvement over traditional regression models. Furthermore, the penalty can be used with many loss-based predictive methods, such as linear regression with squared loss and logistic regression with log-likelihood loss. Simulations to evaluate the performance of this model in the cases of low dimensions and high dimensions show that our proposed approach outperforms all other benchmarks. We verify this for uniform graph, nonuniform graph, balanced-sample, and unbalanced-sample datasets. The approach was applied to predicting the response values on a ‘follow’ social network of Tencent Weibo users and on two citation networks (Cora and CiteSeer). Each instance verifies that the proposed method combining covariate information and link structure with the graph kernel regularization can improve predictive performance.

Keywords: Graph regularization, node effect, kernel function, prediction, regression

1. Introduction

Recent decades have seen significant advances in data collection, leading to graph data being collected in many modern applications and corresponding graphs being created that record the relationship of individuals in a system. This information is often collected along with covariate information on each unit of the system [17]. For example, the graph data we focus on in this paper include social and citation networks. A large body of work on traditional regression methods typically takes into consideration only information about covariates and assumes that samples are mutually independent. However, samples usually come from individuals connected by a network. Much literature also pays attention to analyzing the graph structure along with the underlying network, such as work on community detecting [1,6]. Obviously, if the covariate of each node can be used to analyze the network itself, it will be more efficient to find the best communities [2,15,25]. In addition, there is no shortage of methods considering both nodes attributes and graph link connections simultaneously. For example, Zhang et al. [24] put forward a method with the idea of collective inference, which belongs to semi-supervised learning field and is generally applicable to the classification problem. Li et al. [13] proposed the RNC (Regression model with Network Cohesion) model with the idea of network cohesion to solve the problem of network regression. They assume that the nodes in a network have different nodes effect (varying intercepts) and the nodes effect of connected nodes in a network tend to be similar. Based on the assumptions above, Li et al. [13] obtained the RNC model by constructing a regularization term on the nodes effect vector through the network Laplacian matrix. However, the RNC model is only effective in low-dimensional situations because its penalty only acts on the nodes effect without considering the regression parameters. Also, Tibshirani and Land [20] proposed their models with the idea of fusion penalties, which have been used to select variables based on the network-constrained variables [8,10–12,16], but this line of work is not directly relevant to ours because what we are really interested in is networks consisting of observations rather than variables.

On the other hand, the sample dimension in many studies is usually much larger than the number of samples, that is the common high-dimensional low sample problem. In this case, an ordinary linear regression model can not be leveraged directly to estimate the regression parameters. To solve this problem, researchers have proposed many regularization methods for identifying key variables in a regression framework, such as Lasso [19], SCAD [5], Elastic-Net [27], fused Lasso [20], Lars [4], adaptive Lasso [26], group Lasso [22] and L12+2 hybrid regularization method [7,9,14]. All the above methods have group effect to some extent, the covariates with strong correlation are either selected or eliminated at the same time. However, the above methods have common limitations, that is, these methods are only from the perspective of calculation or algorithm, and do not use any prior knowledge such as graph structure information. In recent years, network penalty usually defined as the quadratic form of Laplacian matrix has been used in a large number of practical applications. For example, Li and Li [11], Chen and Zhang [3] and Wang et al. [21] used the network-based L1 penalty to conduct regression analysis to select key variables on genomic data. Thus, we expect that considering a fused penalty on both covariate and network link information in regression models will be better.

Our task in this paper is to predict the response value using both covariate and graph link information without considering the reason why the linked nodes behave similarly. Different from the classical regression models, we consider the following interpretation: if nodes u and v are linked, then they are more likely to have similar predictive values, that is p(u)=p(v). This assumption is helpful to construct the regularization term in our proposed model, and it violates almost all guarantees that could provide good performance for the traditional regression models, while we expect it to be helpful in our predictive tasks since it suggests pooling information from neighbors. The objective function of our proposed method can be formulated as a loss function plus a penalty acting on both covariates and graph Laplacian and its explicit solution can be derived under certain conditions. In contrast to previous work, no information about the potential groups of the underlying network is required, and we verify that our proposed model presents an improvement beyond the state of the art local regularized regression models on graph data through analyzing various cases such as a uniform graph, nonuniform graph, balanced-sample, or unbalanced-sample in low-dimensional and high-dimensional situations, especially in high dimensions. Furthermore, the proposed model is effective for both sparse and dense networks, the latter with an extra graph sparsification while keeping good predictive performance.

The rest of our paper is organized as follows. In Section 2, we deduce our model in the setting of linear regression. We frame it as a penalized least squares problem that has a closed-form solution. Then the methodology based on the linear regression can be extended to predictive models with log-likelihood loss function, although we only explicitly investigate the logistic regression. We exhibit a discussion of relevant theoretical properties of the estimator in Section 3. Section 4 gives a brief comparison of the simulated numerical performance for various cases. Section 5 presents a detailed comparison of our approach with several different methods on the representative ‘follow’ network(in Tencent Weibo) and citation networks(Cora and CiteSeer). We apply our method to predict user activity calculated from known statistics on the ‘follow’ network, and we aim to predict the label on the citation networks. Subsequently, a concise summary and future research direction in Section 6 concludes the paper.

2. Regression with graph kernel regularization

We start by introducing the notations. All vectors are treated as column vectors by default. The underlying dataset that we are going to study consists of n observations (x1,y1),(x2,y2),,(xn,yn), where xiRp is the vector of covariates and yiR is the response variable for ith observation. We write Y=(y1,y2,,yn)T for the response variable vector and X=(x1,x2,,xn)T for the n×p design matrix, which is treated as fixed and whose columns have been standardized to have mean 0 and variance 1. The graph connecting observations is G=(V,E), where V=(1,2,,n) is the node set, and EV×V is the edge set.The graph structure can be represented by its adjacency matrix ARn×n, whose element Auv=1 if nodes u and v are linked by an edge, i.e. (u,v)E, and 0 otherwise. We assume that there are no loops, therefore, Avv=0 for any node v and the underlying graph is undirected, that is, Auv=Avu. The network Laplacian of G is measured by L=DA, where D=(d1,d2,,dn) represents the diagonal degree matrix with node degree du defined as du=vVAuv.

2.1. Linear regression with graph kernel regularization

We define our problem more abstractly as follows in the context of linear regression. We wish to predict a real-valued output yi based on the corresponding covariates xi and the graph link structure. Assume that the graph-based linear regression model can be constructed as

yi=xiβ+μαi+ϵi, (1)

and its matrix form can be written as

Y=X~θ+ϵ, (2)

where X~=(μIn,X), θ=(α,β)T and ϵ=(ϵ1,,ϵn)T. To be more detailed, β=(β1,,βp)T is a p-dimensional regression coefficients vector of the covariates and α=(α1,,αn)T, a column vector containing information that a fixed intercept term cannot explain, represents the individual nodes effect. In other words, we can also treat αis as a special attribute of each node, which behaves similarly whenever two different nodes are linked. As for μ, it is a small tuning parameter greater than zero to make the system strictly positive and thus to obtain numerical stability. Moreover, no assumption is made about the distribution of the error ε, although we do need to assume that Eϵ=0,Var(ϵ)=σ2In, where In represents n×n identity matrix.

Including individual nodes effect α rather than a shared intercept is important to considering network similarity. Generally, α and β, n + p unknown parameters totally, cannot be estimated from n observations without additional assumptions. Thus, we are going to construct a regularization on θ including α and β and obtain the general objective function L(θ) as follows

L(θ)=l(θ)+R(θ), (3)

where l() represents general loss function and R(θ) represents the regularization term on θ. In particular, if l() takes the square loss function and R(θ) represents the L2 norm of θ, Equation (3) is the ridge regression. Also, by defining a proper kernel function, ridge regression can be re-written as

fˆ=argminfRn{1ni=1nl(yi,fi)+λ2fTK1f}, (4)

where λ is a tuning parameter and can be selected by cross-validation, f=(f1,f2,,fn) represents the vector of predictive values and K=[k(xi,xj)] is the kernel gram matrix with k(xi,xj)=xiTxj for i,j=1,2,,n. Moreover, we shall replace K by μI+K in order to keep numerical stability since K may not be invertible.

In graph-based learning, we can obtain the predictive values fis by constructing a regularization condition on L and solve the following optimization problem:

fˆ=argminfRn{1ni=1nl(yi,fi)+λ2fTLf}, (5)

where λ is a hyperparameter. If we have a weighted graph with edges E and weights wij associated with (i,j)E, according to the property of the Laplacian, we have

fTLf=(i,j)Ewij(fjfj)2. (6)

We just set wij=1 in this paper, since the identification of weights is not our main topic. But it should be noted that the regularization operator Laplacian L can be treated as an inverse of some defined kernel matrix from the aspect of kernel, and L1 exists as long as there are edges in the network, which could be seen from its definition and the fact that Equation (6) is greater than or equal to zero.

We are naturally to combine K and L with the following two Lemmas based on Zhang et al. [24].

Lemma 2.1

Let K=K1+K2, then we have

fTK1f=minf1,f2Rn{f1TK11f1+f2TK21f2:f=f1+f2}.

Lemma 2.2

Let K=[k(xi,xj)] for i,j=1,2,,n, then

fT(μI+K)1f=minβ{βTβ+μ1i=1n(fiβTxi)2}.

The proofs of Lemmas 2.1 and 2.2 are given in Appendix A and B. By using them, we can obtain the Regression model with Graph Kernel regularization (RGK) as follows

fˆ=argminfRn{1ni=1nl(yi,fi)+λ2θTθ+λ2μαTLα}, (7)

where fi=βTxi+αiμ,(i=1,2,,n). The detailed derivation process is shown in Appendix C. In particular, if l() in Equation (7) takes the square loss, we have

L(θ)=12YX~θ2+λ2θTθ+λ2μαTLα. (8)

Where 2 denotes the square of L2 norm and the penalty term λ2θTθ+λ2μαTLα can be regarded as the fusion of penalties of ridge regression and network, which can not only used to reduce the number of covariates but also control the nodes effect. Furthermore, note that: 1) we avoid computation related to K, reducing the computing cost extremely; and 2) Equation (8) is convex in θ, implying that there is only one global optimal solution, that is

θˆ=(X~TX~+λIn+p+λμM)1X~TY, (9)

where

M=(L0n×p0p×n0p×p).

In summary, our proposed RGK model combines both covariates and graph information through a linear kernel combination. The objective function of RGK involves a loss function with a regularization term. This regularization can not only be used to compress the number of covariates but also encourage similarity between linked nodes. We expect that it will contribute to improvement over traditional regression models.

2.2. Model prediction and generalization

2.2.1. Prediction model and tuning parameters selection

To obtain the fitted values of the training samples, we just use Xβˆ+μαˆ. However, our task is to predict the response values of the out-of-training samples based on their available covariates and connection relationships. As mentioned above, there are different nodes effect αis, so predicted nodes effect are needed before obtaining the response values for testing samples. We assume that there are n training samples and n test samples, leading to a network containing a total of n+n nodes. Then the corresponding Laplacian can be written as

L=[L11L12L21L22],

where L11 corresponds to the n training samples and L22 corresponds to the n testing samples. Similarly, write the nodes effect vector as (α1,α2), where α1=αˆ is estimated from training samples and α2 needs to be predicted based on the former. With the assumption that connected nodes have similar node effects, we can obtain the nodes effect of the new samples by solving

αˆ2=argmin(αˆ,α2)TL(αˆ,α2).

After taking the derivative, we have αˆ2=L221L21αˆ.

The optimal tuning parameters λ, λ and μ can be selected by 10-fold cross-validation(CV) on the training data. In the CV, the training samples are randomly split into 10 folds, where nine folds of data and the corresponding induced subgraph are leveraged to train a model for validation on the remaining tenth fold, cycling through each of the ten folds in turn. The cross-validation error is computed as the average of the prediction errors on the fold that was left out, and then the (λ, λ, μ) grid that minimizes the cross-validation error is selected as the optimal parameters. However, it will take a lot of time to pick the optimal parameters by 10-folds cross-validation grid search if there are many tuning parameters. Therefore, according to Zhang et al. [23], we could just optimize the hyperparameter λ by 10-folds CV, simply take μ=0.01, and further make λ satisfy λn=0.01 without additional optimization. Furthermore, Zhang et al. (2006) also shows that the parameter setting above would not have a great impact on the numerical results, but we believe that there could be a slightly better result with optimized μ and λ.

2.2.2. Model generalization

We are going to extend the RGK methodology based on the linear regression model to generalized linear regression model or other regression models. Taking for example the logistic regression, we can take the logit function as its link function, and the general form of the link function can be taken as

φ(EY)=Xβ+μα

for any generalized linear model. Suppose the log-likelihood function is L(α,β,X,Y). Then if there are edges in a network, we can fit the model by minimizing

L(α,β,X,Y)+(λ2θTθ+λ2μαTLα). (10)

When L() is concave in α and β, which is the case for exponential families, we can solve this optimization problem through some convex optimization algorithm like Newton-Raphson. Note that the quadratic approximation in Equation (10) is the quadratic approximation to a log-likelihood plus a penalty. This means that the problem can be efficiently solved by iteratively reweighted linear regression with network node effects, just like the generalized linear model is fitted by iteratively reweighted least squares.

3. Theoretical properties of the RGK estimator

Given the RGK estimator presented in Equation (9), it is not difficult to find that θˆ exists iff P=X~TX~+λIn+p+λμM is invertible. Also, we know L is a semi-positive matrix from Equation (6). Proposition 3.1 gives an explicit condition for the existence of θˆ.

Proposition 3.1

The linear RGK estimator θˆ exists whenever λ0.

The proof is given in the Appendix D. Theorem 3.2 shows the bias, variance, and mean squared error of the linear RGK estimator based on Proposition 3.1.

Theorem 3.2

The linear RGK estimator θˆ=(αˆ,βˆ) defined by Equation (9) satisfies:

  1. The bias of the estimator is given by
    b(θˆ)=P1(λIn+p+λμM)θ=P1(X~TX~P)θ,
    where b() denotes the bias of an estimator and P=X~TX~+λIn+p+λμM;
  2. The variance of the estimator is given by
    Var(θˆ)=σ2P1X~TX~P1σ2P1;
  3. The mean square error of the estimator is given by
    MSE(θˆ)=P1(X~TX~P)θ2+σ2tr(P1X~TX~P1)<θ2+σ2tr(P1);
  4. There is an upper bound on the mean square error of the predicted value. That is,
    EX~θˆEY2X~θF2+σ2SF2,
    where SF2 is the Frobenius norm of the matrix S=X~P1X~T.

Theorem 3.2 applies to any fixed n and its proof is given in the Appendix E. Finally, we explore the effect of the graph sparsification when the given graph is dense, which means that both its adjacent matrix and Laplacian are dense matrices. We can leverage the sparsification method proposed in Spielman and Teng [18] to reduce the computing cost. For any ϵ0, which controls the sparsity degree of the given graph, let L be the Laplacian of a network on the same nodes satisfying

(1ϵ)LL(1+ϵ)L. (11)

Besides, let θˆ be the minimizer of g(θ)=(θ)+θTWθ, where (θ) denotes loss function and W=(λIn+p+λμM). Similarly, let θˆ be the minimizer of g(θ)=(θ)+θTWθ, where W=(λIn+p+λμM). Theorem 3.3 provides theoretical guarantees for the accuracy of the linear RGK estimator on L and L.

Theorem 3.3

For 0ϵ12, given two Laplacian matrices L and L satisfying Equation (11), assume (θ) is twice differentiable and g is strongly convex with m0, such that for any θRn+p,

2g(θ)mIn+p.

Then θˆ and θˆ satisfy

θˆθˆ22mmin{λμϵ[2θˆTMθ+|θˆTMθˆθˆTMθ|+2ϵθˆTMθˆ]2(λμϵ)2m+λϵ(1+2ϵ)θˆTθˆ,2(λμϵ)2mMθˆ2}. (12)

The proof is given in the Appendix F. Note that the term θˆTMθ=αˆTLαˆ is expected to be small, and terms |θˆTMθˆθˆTMθ|, αˆTLαˆ and θˆTθˆ are much smaller than the first term. Moreover, the second bound in Equation (12) is typically much greater than the first bound. Thus, we can simplify the equation and obtain a theoretical upper bound of the approximation error θˆθˆ2, that is

θˆθˆ24λμϵmαˆTLαˆ.

Clearly, there is a linear relationship between the squared error bound of estimator and ε. Generally, the range of ε is from zero to one, since the current graph may be disconnected when ε is greater than one.

4. Numerical analysis

Simulations are conducted to investigate the effects of the RGK model on graph data in the context of linear and logistic regression. We illustrate it in low-dimensional and high-dimensional situations respectively and subdivide these into uniform and nonuniform graph cases. We end this section with a graph sparsification example.

4.1. Linear RGK

4.1.1. Low dimension

We conducted experiments on a simulated graph with n nodes assigned to K blocks in case of n>p. The nodes were allocated to some block independently by sampling from a multinomial distribution PN(π1,π2,,πK), and then each block was labeled with ci(i=1,2,,K). Edges eijs were generated as independent Bernoulli random variables with p(eij=1)=pij, thereby forming a symmetric probability matrix containing the probabilities of within-block and between-block connections. Here we considered two cases: the uniform graph (the probabilities of nodes belonging to a certain block were equal, i.e. π1=π2==πK) and the nonuniform graph (the probabilities were generated from the uniform distribution). We set pii, the probability of within-block connection, to 0.8 and pij of between-block connection to 0.2 for ij. The generated networks can be visualized in Figure 1. In addition, the covariates were generated from Pois(1), predictor coefficients β from N(0,1) and αis from N(τi,s2), where τi denotes the expectation of the ith block and s2 corresponds to its variance. It should be pointed out that our experimental settings are almost consistent with that of the RNC model proposed in Li et al. [13], except that the latter only considers the case of low dimensional uniform network. Also, the objective function of RNC is

L(α,β)=yXβα2+λαTLα,

where α is the vector of nodes effect and λ is a tuning parameter, which can be identified by cross-validation. Therefore, to highlight the novelty more easily, a brief comparison between RNC and RGK models is conducted, including that 1) the RNC model is only effective in low-dimensional situation because of its limited scope of application; and 2) the penalty of RNC model only acts on the nodes effect without considering the regression parameters. However, our proposed RGK model could negate the above two points.

Figure 1.

Figure 1.

Graphs for linear low-dimension case (K = 3). (a) Uniform graph and (b) Nonuniform graph.

We compared the performance of our proposed linear RGK with RNC, ridge regression ( RI_REG), Lasso and ordinary linear regression (OLS) on each simulated graph in terms of the predictive mean absolute error(MAE). For each model, our simulated data consisted of a training set and a test set with a sample size of 100. Models were fitted on the training set only, and we computed the predictive mean absolute errors on the test data. We repeated the simulations 100 times independently for each model. Moreover, the hyperparameter λ was identified by 10-fold cross-validation, while the tuning parameter μ was just simply fixed as 0.01 and λ satisfied λn=0.01 as mentioned early. Table 1 shows the predictive mean absolute errors of the above methods.

Table 1.

Comparison of MAE (SE) using different models for uniform graph and nonuniform graph cases in the low-dimensional situation.

Methods Uniform graph Nonuniform graph
Linear RGK 0.8463 0.8071
  (0.0382) (0.0505)
RNC 0.8491 0.8107
  (0.0384) (0.0493)
RID_REG 0.9024 0.8541
  (0.0502) (0.0447)
Lasso 0.8966 0.8481
  (0.0422) (0.0470)
OLS 0.8968 0.8470
  (0.0415) (0.0466)

The results reveal that our proposed linear RGK method gave the smallest MAE among the compared procedures above for each case. Specially, both the linear RGK and RNC models are better than other methods, which indicates considering both covariates and link information is proper. Furthermore, the MAE of linear RGK is indeed smaller than that of RNC with almost equal variance, though it is not remarkable, which might be caused by that: (1) the effect of our RGK penalty on covariates is not obvious due to the small number of covariates; (2) the tuning parameters μ and λ are not optimized. We expect that there will be a significant improvement if we take steps to negate the above two points.

4.1.2. High dimension

In case of p>n, we generated graphs, illustrated in Figure 2, as in the low-dimensional case except each node was accompanied with 1000 covariates. We compared the performance of our linear RGK with ridge regression and Lasso in terms of MAE. Table 2 gives the average results of 50 simulations for each model.

Table 2.

Comparison of MAE (SE) using different models for uniform graph and nonuniform graph cases in the high-dimensional situation.

Methods Uniform graph Nonuniform graph
Linear RGK 22.9725 23.2285
  (1.6939) (2.0949)
RID_REG 32.9590 33.4138
  (2.5616) (3.1157)
Lasso 35.8948 36.8282
  (2.8278) (2.7614)
Figure 2.

Figure 2.

Graphs for linear high-dimension case (K = 3). (a) Uniform graph and (b) Nonuniform graph.

We can obtain a direct insight on the result that the linear RGK still gives the smallest MAE, about a third less than the other procedures. Ridge regression performed a little bit better than Lasso with almost equal standard deviations, which might be a result of multicollinearity in the covariates. This high-dimensional case confirms that incorporating both covariate information and network link structure into graph-based linear predictive models can exactly lead to considerable improvement. Moreover, the penalty imposed on these two information sources contributed to better accuracy. We conclude that the linear RGK method performed better than other procedures in low and high dimensions, especially in the latter case.

4.2. Logistic RGK

4.2.1. Low dimension

We conducted simulations to evaluate our proposed logistic RGK model and compared it with RNC, logistic regression(LR), ridge regression, Lasso and OLS in terms of predictive accuracy. We simulated the graphs seen in Figure 3 just as we did for the linear RGK case in the low-dimensional situation. However, the covariates were generated from N(0,1) and the response variable yis from the Bernoulli distribution with probabilities of success given by the logit function of XTβ+α. Therefore, the results can be naturally scaled between 0 and 1. All models were fitted on the same training set, and we computed the predictive recall scores, F1 scores, and AUC values on the test data with a sample size of 100. For each model, we repeated the simulation 50 times. Moreover, it is important to point out that AUC is irrelevant to the threshold, while recall score and F1 score are closely related to the selection of the threshold. We set the threshold to 0.5, which means that a sample is a positive example if its predictive probability is greater than 0.5 and is a negative example otherwise. Table 3 shows the predictive recall scores and F1 scores, and corresponding comparison of ROC curves and AUC values for the models above is shown in Figure 4.

Table 3.

Comparison of predictive recall score and F1 score (SE) using different models for uniform graph and nonuniform graph cases in the low-dimensional situation.

  Uniform graph Nonuniform graph
Methods recall score F1 score recall score F1 score
Logistic RGK 0.7829 0.7143 0.7840 0.7239
  (0.1091) (0.0659) (0.1018) (0.0699)
RNC 0.7053 0.7069 0.7193 0.7212
  (0.0953) (0.0670) (0.0898) (0.0700)
LR 0.6952 0.6980 0.7041 0.7079
  (0.0986) (0.0696) (0.1010) (0.0755)
RID_REG 0.9980 0.6727 0.9972 0.6712
  (0.0062) (0.0389) (0.0098) (0.0522)
Lasso 0.9975 0.6736 0.9952 0.6722
  (0.0068) (0.0394) (0.0117) (0.0530)
OLS 0.9967 0.6736 0.9948 0.6733
  (0.0085) (0.0393) (0.0125) (0.0533)
Figure 3.

Figure 3.

Graphs for logistic low-dimension case (K = 4). (a) Uniform graph and (b) Nonuniform graph.

Figure 4.

Figure 4.

ROC curves comparison for logistic low-dimension case (K = 4). (a) ROC curves of several procedures for uniform graph and (b) ROC curves of several procedures for nonuniform graph.

From the comparison seen in Table 3 and Figure 4, it is easy to see that our proposed logistic RGK can perform better than other benchmarks. The logistic RGK gave the largest F1 score and AUC value, which indicates that our logistic RGK procedure had a better classification capability with threshold 0.5 on both the uniform graph and nonuniform graph data when n>p. Moreover, we can also find that although the F1 scores of ridge regression, Lasso and OLS are not so good, their recall scores are significantly bigger than other models, almost to 1, implying that their ability to recognize the minority class is better but there are also many misclassification cases.

4.2.2. High dimension

Simulations were conducted to evaluate the performance of the logistic RGK when p>n and to compare it with logistic regression, ridge regression and Lasso in terms of predictive classification accuracy. The way to generate graphs was consistent with that in Section 4.2.1, while the covariates with a size of 1000 were sampled independently from N(1,5). We repeated the simulations 50 times for each model, and generated graphs in Figure 5. Additionally, Table 4 shows the predictive recall scores and F1 scores. The corresponding ROC curves and AUC values for the above models are shown in Figure 6.

Figure 5.

Figure 5.

Graphs for logistic high-dimension case (K = 4). (a) Uniform graph and (b) Nonuniform graph.

Table 4.

Comparison of predictive recall score and F1 score (SE) using different models for uniform graph and nonuniform graph cases in the high-dimensional situation.

  Uniform graph Nonuniform graph
Methods recall score F1 score recall score F1 score
Logistic RGK 0.9822 0.6622 0.9783 0.6616
  (0.0167) (0.0891) (0.0191) (0.0904)
LR 0.5081 0.4981 0.5033 0.4958
  (0.0751) (0.0700) (0.0624) (0.0812)
RI_REG 0.9993 0.6540 0.9972 0.6544
  (0.0018) (0.0924) (0.0098) (0.0522)
Lasso 0.9982 0.6552 0.9969 0.6546
  (0.0092) (0.0911) (0.0107) (0.0922)
Figure 6.

Figure 6.

ROC curves comparison for logistic high-dimension case (K = 4). (a) ROC curves of several procedures for uniform graph and (b) ROC curves of several procedures for nonuniform graph.

In terms of F1 score and AUC value, we observed that logistic RGK outperformed all benchmark models in both uniform and nonuniform networks. Moreover, the recall scores were extremely similar to those in the low-dimensional case. All in all, the logistic RGK approach was better than all benchmarks on the simulated graph datasets both in the low-dimensional and high-dimensional situations, especially in the latter.

4.3. Linear RGK on the sparsified graph

This subsection uses a simple example to illustrate the graph sparsification approach in the linear regression. According to Spielman and Teng [18], we can find A, a sparsified graph adjacency matrix, so that a corresponding Laplacian matrix L can be found satisfying Equation (11) with a nonnegative constant ε. To understand the sparsification process further, we refer readers to [18]. In this experiment, we generated a network with 1000 nodes. In addition, we set ε to 1, and the other settings were as in the uniform graph case of linear RGK in the low-dimensional situation. After calculation, we obtained that the average degree of the original network and corresponding sparsified network are 167.498 and 55.114, respectively, and the approximation quality value, a measurable index proposed in [18], is 0.93.

Next, we compared the performance between the linear RGK with original L and sparsified L using two aspects. Firstly, the time cost of the original adjacency matrix is twice that of the corresponding sparsified version. Secondly, the performance with the latter matrix is almost identical to that with the former in terms of the predictive mean absolute error, which implies that the sparsified L captures most of the structural information that the original graph contains. Therefore, when given a dense graph with covariates in the predictive task, we could use the sparsified L to replace the original L, thereby reducing computation cost while maintaining good performance.

5. Real data analysis

We investigated the performance of our proposed RGK on real graph data, including linear RGK on a ‘follow’ network (in Tencent Weibo users) and logistic RGK on citation networks (Cora and CiteSeer).

5.1. Linear RGK on social networks

The users in the ‘follow’ network, numbered in the millions, are provided with rich information (e.g. demographics, profile keywords, follow history, action data) to generate a good prediction model. In this application, the follow history is used to form an undirected social network. The action data, including statistics about the ‘at’ ( @) actions between the users and the number of tweets, retweets, and comments in a period of 90 days, can be used to describe user behavior. The profile keywords containing the keywords extracted from the tweet/retweet/comments of a user can be used as one s features in our prediction model.

5.1.1. Low dimension

Our task is to predict the users activity from their covariates and the current network. The response value is calculated based on the action data of each individual, that is, the summation of ‘at’ ( @) actions between the users and the number of tweets, retweets, and comments, each divided by the length of time. Note that we just selected a subgraph as our data in this experiment since the original data is too huge; this subgraph consisted of 1652 nodes with 8546 covariates. Furthermore, what we investigated here was in the low-dimensional context, so some dimension reduction method was needed to reduce the number of covariates. We adopted the random forest(RF) to arrive at this goal, and we were left with 30 covariates according to the importance rank of RF. Therefore, our focus was concentrated on this selected network.

We compared the performance of linear RGK with RNC, ridge regression and OLS. We did five runs and reported the averages and standard deviations in terms of MAE. Specially, we held out 451 samples as the test samples and fitted all the models on the rest data. In addition, the way to determining the tuning parameters is exactly the same as before. Table 5 shows the predictive MAE of the methods above.

Table 5.

Comparison of MAE (SE) using different models in the low-dimensional situation.

  Linear RGK RNC RID_REG OLS
MAE 5.4313 7.9881 5.8204 6.1457
  (0.3712) (0.5830) (0.3615) (0.9944)

Table 5 shows that our linear RGK outperformed all benchmarks in terms of the MAE. Note that both ridge regression and OLS performed better than RNC, which might be due to the following two facts: (1) the penalty of RNC is only imposed on the nodes effect without constraints on the regression parameters; (2) the ‘follow’ network is so sparsified that the nodes' attributes may be more useful than the graph link information for model effect. Moreover, ridge regression outperformed OLS slightly, implying that there could be multicollinearity in the covariates. All the analysis above indicates that imposing the penalty on covariates and graph structure is reasonable.

5.1.2. High dimension

In this section, all 8546 covariates were taken into consideration and other settings were like that in Section 5.1.1. We compared the performance of the linear RGK with Lasso and ridge regression. For each method, we performed five runs and reported the results on the test data. Table 6 shows the results of these different models.

Table 6.

Comparison of MAE (SE) using different models in the high-dimensional situation.

  Linear RGK RID_REG Lasso
MAE 5.4930 5.6967 5.7550
  (0.2758) (0.2869) (0.3015)

Clearly, the linear RGK gave the smallest MAE with the smallest standard deviation. It is not surprising that the performance difference between these models above is not explicitly remarkable, which could be caused by: (1) We did not optimize the tuning parameters μ and λ using cross-validation. (2) The attributes of each node often carry more information than the graph link structure, especially when the network is extremely sparsified. So far, we have verified that our proposed linear RGK outperforms the benchmarks consistently on both simulated and real graph data.

5.2. Logistic RGK on citation networks

Cora consists of machine learning papers tagged with one of these labels: case-based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, or theory. Similarly, papers in CiteSeer are tagged with Agents, AI, DB, IR, ML, or HCI. In addition, these papers were selected in a way such that each paper was quoted or cited by others in the final corpus, leading to 2708 papers in the whole corpus of Cora and 3312 in CiteSeer. After stemming and removing stopwords, we were left with a vocabulary of 500 unique words for the former and 3703 for the latter. All words with document frequency less than 10 were removed in these two networks.

5.2.1. Low dimension

Considering the logistic RGK is suitable for binary classification and the labels in Cora are unbalanced, we selected binary subnetworks, respectively, in cases of the balanced-sample and unbalanced-sample. Specially, we chose nodes labeled theory and case-based for the former and nodes tagged rule learnings and probabilistic methods for the latter.

We compared the performance of our proposed logistic RGK with RNC and logistic regression in terms of the predictive error and accuracy. The two networks were segmented into a training set and a test set with a sample size proportion of 30%, respectively. Models were fitted on the same training set, and we computed the log losses, recall scores and Fβ(β=1,2) scores of these models on the same test set. For each model, we performed five trials and reported the averaged predictive results. Table 7 shows the detailed results in balanced-sample and unbalanced-sample cases.

Table 7.

Comparison of predictive log loss, recall score, and Fβ score (SE) using different methods for balanced-sample and unbalanced-sample cases in the low-dimensional situation.

  Balanced samples Unbalanced samples
Methods Log loss Recall score F1 score Log loss Recall score F2 score
Logistic RGK 0.7605 0.7955 0.5641 0.7752 0.7701 0.5883
  (0.0056) (0.0393) (0.0207) (0.0185) (0.0264) (0.0279)
RNC 4.3978 0.5169 0.4891 4.7006 0.3876 0.3600
  (0.5065) (0.0360) (0.0366) (0.4476) (0.0588) (0.0455)
LR 5.4068 0.5707 0.5378 5.6020 0.4983 0.4461
  (0.3381) (0.0380) (0.0226) (0.2807) (0.0540) (0.0398)

Obviously, the logistic RGK gave a smaller log loss than RNC and logistic regression in both the balanced-sample and unbalanced-sample cases. Moreover, we found that the logistic RGK was under-confident, while the other two methods were over-confident through comparing the predictive probabilities of each model. It is because the probabilities of the former were mainly between 0.5 and 0.7, while the latter were extremely close to 0 or 1. Moreover, the logistic RGK also performed better in terms of Fβ score and recall score, showing that the logistic RGK is a better classifier.

5.2.2. High dimension

Like the case in Cora, We selected nodes labeled Agents and ML as the final dataset in the balanced-sample case and nodes labeled AI and IR for the unbalanced-sample case. The experimental setting is completely identical to that in the Cora except that here we only compared the performance of the logistic RGK with ordinary logistic regression, due to RNC being inappropriate for high-dimensional data. Table 8 gives the detailed comparative results for the balanced-sample and unbalanced-sample cases.

Table 8.

Comparison of predictive log loss, recall score, and Fβ score (SE) using Logistic RGK and logistic regression(LR) for balanced-sample and unbalanced-sample cases in the high-dimensional situation.

  Balanced samples Unbalanced samples
Methods Log loss Recall score F1 score Log loss Recall score F2 score
Logistic RGK 0.8403 0.9960 0.6406 0.8992 0.9933 0.5769
  (0.0087) (0.0036) (0.0136) (0.0368) (0.0149) (0.0671)
LR 6.7880 0.6098 0.5269 5.9456 0.5275 0.4171
  (0.5569) (0.2199) (0.0566) (0.3827) (0.0901) (0.0842)

Table 8 shows that the logistic RGK outperformed logistic regression remarkably in terms of the predictive log loss, recall score, and Fβ score both in the balanced-sample and unbalanced-sample cases. To sum up, we conclude that our proposed logistic RGK model outperforms other methods on these two citation networks.

6. Conclusion and future research

This paper introduces a novel method based on the graph regularization formulation of the kernel method for learning from both covariates and graph link structure. The method is applicable to regression problems for graph data. The objective function of our model is a well-formed convex optimization problem, which means that a globally optimal solution can be computed efficiently. In addition, imposing a penalty on both covariates and the graph Laplacian not only leads to a reduction of covariates but also provides a clear trade-off between the bias and variance. Moreover, the proposed RGK model is somewhat similar to a standard kernel method, with an appropriately defined kernel based on the underlying graph. Experimental results on both simulated graph data and real networks in various situations indicate that the RGK model can lead to better performance in graph regression tasks.

In the future research, we are going to extend the RGK methodology to other situations such as Cox s model and to investigate the performance of RGK on other kinds of networks such as random networks, which is significantly different from what we did here, treating the underlying graph as fixed. We would expect that applying our combined graph kernel regularization to random networks could lead to a significant improvement compared to the previous methods.

Supplementary Material

Supplemental Material

Funding Statement

This work was supported by the National Natural Science Foundation of China (NSFC) [grant number 71771201], [grant number 71874171], [grant number 71731010], [grant number 71631006], [grant number 71991464], [grant number 71871208] and [grant number 72071193], Anhui Provincial Natural Science Foundation, the Ministry of Education Humanities and Social Science Project of China [grant number 20YJA910006], Natural Science Foundation of Jiangsu Province (No. BK20201396), and Natural Science Foundation of the Higher Education Institutions of Jiangsu Province [grant number 19KJA180003].

Disclosure statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • 1.Abbe E., Community detection and stochastic block models: Recent developments, J. Mach. Learn. Res. 18 (2017), pp. 6446–6531. [Google Scholar]
  • 2.Binkiewicz N., Vogelstein J.T., and Rohe K., Covariate-assisted spectral clustering, Biometrika 104 (2017), pp. 361–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chen J. and Zhang S., Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data, Bioinformatics 32 (2016), pp. 1724–1732. [DOI] [PubMed] [Google Scholar]
  • 4.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Stat. 32 (2004), pp. 407–499. [Google Scholar]
  • 5.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
  • 6.Goldenberg A., Zheng A.X., Fienberg S.E., and Airoldi E.M., A Survey of Statistical Network Models, Now Publishers Inc, 2010. [Google Scholar]
  • 7.Huang H.H. and Liang Y., Hybrid l1/2+2 method for gene selection in the cox proportional hazards model, Comput. Methods. Programs. Biomed. 164 (2018), pp. 65–73. [DOI] [PubMed] [Google Scholar]
  • 8.Huang H.H. and Liang Y., An integrative analysis system of gene expression using self-paced learning and scad-net, Expert. Syst. Appl. 135 (2019), pp. 102–112. [Google Scholar]
  • 9.Huang H.H., Liu X.Y., and Liang Y., Feature selection and cancer classification via sparse logistic regression with the hybrid l1/2+2 regularization, PLoS. ONE. 11 (2016), pp. e0149675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kim S., Pan W., and Shen X., Network-based penalized regression with application to genomic data, Biometrics 69 (2013), pp. 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li C. and Li H., Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008), pp. 1175–1182. [DOI] [PubMed] [Google Scholar]
  • 12.Li C. and Li H., Variable selection and regression analysis for graph-structured covariates with an application to genomics, Ann. Appl. Stat. 4 (2010), pp. 1498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li T., Levina E., and Zhu J., Prediction models for network-linked data, Ann. Appl. Stat. 13 (2019), pp. 132–164. [Google Scholar]
  • 14.Liang Y., Liu C., Luan X.Z., Leung K.S., Chan T.M., Xu Z.B., and Zhang H., Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification, BMC. Bioinformatics. 14 (2013), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Newman M.E. and Clauset A., Structure and inference in annotated networks, Nat. Commun. 7 (2016), pp. 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pan W., Xie B., and Shen X., Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010), pp. 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pearson M. and West P., Drifting smoke rings, Connections 25 (2003), pp. 59–76. [Google Scholar]
  • 18.Spielman D.A. and Teng S.H., Spectral sparsification of graphs, SIAM J. Comput. 40 (2011), pp. 981–1025. [Google Scholar]
  • 19.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Statist. Soc. Ser. B (Methodological) 58 (1996), pp. 267–288. [Google Scholar]
  • 20.Tibshirani R., Saunders M., Rosset S., Zhu J., and Knight K., Sparsity and smoothness via the fused lasso, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 67 (2005), pp. 91–108. [Google Scholar]
  • 21.Wang R., Su C., Wang X., Fu Q., Gao X., Zhang C., Yang J., Yang X., and Wei M., Global gene expression analysis combined with a genomics approach for the identification of signal transduction networks involved in postnatal mouse myocardial proliferation and development, Int. J. Mol. Med. 41 (2018), pp. 311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yuan M. and Lin Y., Model selection and estimation in regression with grouped variables, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 68 (2006), pp. 49–67. [Google Scholar]
  • 23.Zhang T. and Ando R.K., Analysis of spectral kernel design based semi-supervised learning, in Advances in Neural Information Processing Systems, 2006, pp. 1601–1608.
  • 24.Zhang T., Popescul A., and Dom B., Linear prediction models with graph regularization for web-page categorization, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 821–826.
  • 25.Zhang Y., Levina E., and Zhu J., Community detection in networks with node features, Electron. J. Stat. 10 (2016), pp. 3153–3178. [Google Scholar]
  • 26.Zou H., The adaptive lasso and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. [Google Scholar]
  • 27.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Statist. Soc. Ser. B (Statist. Methodol.) 67 (2005), pp. 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES