Skip to main content
PLOS One logoLink to PLOS One
. 2021 Aug 13;16(8):e0256187. doi: 10.1371/journal.pone.0256187

Compressing deep graph convolution network with multi-staged knowledge distillation

Junghun Kim 1, Jinhong Jung 2, U Kang 1,*
Editor: Yuchen Qiu3
PMCID: PMC8363007  PMID: 34388224

Abstract

Given a trained deep graph convolution network (GCN), how can we effectively compress it into a compact network without significant loss of accuracy? Compressing a trained deep GCN into a compact GCN is of great importance for implementing the model to environments such as mobile or embedded systems, which have limited computing resources. However, previous works for compressing deep GCNs do not consider the multi-hop aggregation of the deep GCNs, though it is the main purpose for their multiple GCN layers. In this work, we propose MustaD (Multi-staged knowledge Distillation), a novel approach for compressing deep GCNs to single-layered GCNs through multi-staged knowledge distillation (KD). MustaD distills the knowledge of 1) the aggregation from multiple GCN layers as well as 2) task prediction while preserving the multi-hop feature aggregation of deep GCNs by a single effective layer. Extensive experiments on four real-world datasets show that MustaD provides the state-of-the-art performance compared to other KD based methods. Specifically, MustaD presents up to 4.21%p improvement of accuracy compared to the second-best KD models.

Introduction

Given a trained deep graph convolution network, how can we compress it into a compact network without a significant drop in accuracy? Graph Convolution Network (GCN) [1] learns latent node representations in graph data, and plays a crucial role as a feature extractor when a model is jointly trained to learn node features and perform a specific task. GCN has attracted considerable attention from research community because it enables researchers to easily and effectively analyze graphs. Various GCN models [24] have been proposed to boost the performance of tasks on real-world graphs such as node and graph classification [1], link prediction [5], relation reasoning [6], etc.

Recently, the research on deep-layered GCNs is highly in progress to extract sophisticated node features in large and complicated graphs [713]. Those deep GCN models have many layers to understand patterns of large graphs better and improve their performance. However, as the number of layers increases, the number of parameters to be trained also increases, and this leads to a non-negligible increase of model size. Therefore, it is difficult to use those large models in environments having limited computing resources such as mobile or embedded systems.

Model compression aims to learn compressed and lightweight deep networks for low-powered and resource-limited devices without a significant loss of predictive accuracy. For the purpose, many researchers have proposed various strategies such as parameter pruning [14], low-rank factorization [15], weight quantization [16], and knowledge distillation [17]. Among them, Knowledge Distillation (KD) has been popular due to its simplicity based on a student-teacher model; KD distills the knowledge from a large teacher model into a smaller student model so that the student performs as well as the teacher [1820]. In this context, Yang et al. [21] have recently proposed a KD method called LSP (Local Structure Preserving) for compressing GCN models. However, LSP deals with rather shallow models, and only distills limited knowledge on feature aggregation of a teacher while disregarding various aspects to be considered when a network becomes deep. Specifically, LSP does not consider the teacher’s knowledge on multi-hop feature aggregation although the process is essentially involved in a deep-layered GCN; thus, its performance on preserving accuracy is limited, especially for compressing a deep GCN.

In this paper, we propose MustaD (Multi-staged knowledge Distillation), a novel approach for compressing deep GCNs to single-layered GCNs through multi-staged knowledge distillation (KD) while preserving the multi-hop feature aggregation of deep GCNs. Based on the concept of knowledge distillation, MustaD aims to train a single-layered student GCN with the same or even a lower feature dimension than that of a trained teacher GCN. The framework of MustaD is illustrated in Fig 1. Our main idea is to distill the knowledge of multi-hop feature aggregation from multiple GCN layers as well as that of task prediction. Specifically, the single-layered student learns the knowledge of multi-hop feature aggregation of the teacher by 1) matching hidden feature embeddings from the teacher, and by 2) imitating the multiple GCN layers of the teacher with a single effective layer. The knowledge of task prediction is distilled to the student by transferring the probabilistic prediction vector of the teacher. These multi-staged knowledge distillations guide the student to obtain similar aggregated features and predictions to the deep-layered teacher with significantly less parameters.

Fig 1. Framework of MustaD.

Fig 1

MustaD preserves the multi-hop feature aggregation of a teacher with a single effective layer in a student. Furthermore, MustaD distills knowledge of 1) aggregation from multi-staged GCN layers as well as 2) task prediction. hi;t represents the teacher’s last hidden embedding of node i, and h˜i;s corresponds to the student’s last hidden embedding of node i where the hidden dimension is matched to the teacher. pi;t and p˜i;s denote the prediction probability vectors of node i of the teacher, and the student, respectively.

Fig 2 depicts the overall performance of our MustaD compared to other KD-based methods. Our proposed method Student_MustaD shows the best performance among KD methods, especially for deep teachers.

Fig 2. Accuracy of student models for different number of GCN layers in a teacher model.

Fig 2

Student_KD and Student_LSP represent the students trained by distilling the knowledge of classes, and knowledge of the embedded topological structure of a teacher, respectively. Student_Base corresponds to a model trained with the ground truth labels without the teacher. Note that our proposed MustaD (denoted as Student_MustaD) provides the highest accuracy in most cases. We also observe that MustaD provides much better performance for deep GCN with many layers, unlike competitors whose performances do not improve with more layers.

Our contributions are summarized as follows:

  • Method. We propose MustaD, a novel approach for compressing deep-layered GCNs through distilling the knowledge of both the feature aggregation and the feature representation. We propose a simple but powerful method to preserve the multi-hop feature aggregation of the teacher with significantly less parameters.

  • Theory. We provide theoretical analysis of the proposed MustaD, and show that the expressiveness of the student from MustaD is similar to that of a deep-layered GCN on a spectral domain.

  • Experiment. We validate MustaD on two trained deep GCN models in four datasets compared to other distillation-based GCN compression methods. In particular, we improve the accuracy by 3.95%p, 3.77%p, 4.21%p compared to the second-best KD models on Cora, Citeseer, and Pubmed, respectively. In ogbn-proteins, MustaD presents an 1.55%p improvement in terms of AUC-ROC from the second-best KD model.

The code and the datasets are available at https://github.com/snudatalab/MustaD.

Related work

Many complex and deep networks are proposed to solve real-world tasks such as text classification [22], malware detection [23], in-vehicle intrusion attack detection [24], and web document classification [25]. In particular, several deep graph convolutional networks (GCNs) are proposed to handle real-world graphs [713, 26, 27]. However, it is difficult to use these models in environments with limited computing resources. Therefore, many Knowledge Distillation (KD) methods have been studied to compress a large teacher model to a smaller student model by extracting compact and useful information [17, 18, 2830]. Although those methods improve the efficiency of compression, they are designed for the data in a grid domain only; it is hard for them to be directly applied to the data in the non-grid domain such as graphs.

In this section, we discuss related works on deep GCN and KD methods. Table 1 summarizes the symbols used in this paper.

Table 1. Table of symbols.

Symbol Definition
G=(V,E) input graph. V: node set, E: edge set.
N number of nodes.
d input feature dimension.
XRN×d input feature matrix.
xiRd input feature vector for node i.
H(l)RN×d hidden feature embedding matrix of l-th GCN layer; vector hi(l)Rd of i-th row of H(l) represents that for node i.
hi;tRd teacher’s hidden embedding of node i.
hi;sRd student’s hidden embedding of node i.
Ni set of one-hop neighbors of node i in G.
Emb(⋅) learnable function that maps a given feature onto a new embedding space.
Aggregation(⋅) aggregation function that aggregates hidden features from one-hop neighbors.
K number of layers.
GCNs(⋅) single effective GCN layer in MustaD; shared in the student model.
K(·) kernel function.
DKL(·) Kullback-Leibler divergence.
p i;s prediction probability vector of the student.
p i;t prediction probability vector of the teacher.
λemb hyperparameter for the embedding loss.
λpred hyperparameter for the prediction loss.
α hyperparameter for the initial residual.
β hyperparameter for the identity mapping.

Deep graph convolution network

Since the first GCN has been proposed in [1], many convolution based graph neural networks are proposed [24]. In GCNs, a convolution layer aggregates feature information from one-hop neighbors, and multiple convolution layers aggregate feature information from multi-hop neighbors. Recently, many deep GCNs are studied to consider the multi-hop feature information [912, 27].

ResGCN [7] borrows residual/dense connections and dilated convolutions from CNNs, and adapts them to GCN architectures. GEN [8] is a complementary version of [7]. The model uses a modified graph skip connection which is a pre-activation version of residual connections in ResGCN.

GCNII [13] extends the vanilla GCN model to overcome the over-smoothing problem proposed in [31]; [31] observes that given a renormalized graph convolution matrix P˜ and an input feature matrix X, a K-layer vanilla GCN simulates a fixed K-order polynomial filter P˜KX, and the over-smoothing problem is caused when P˜KX converges to a distribution that does not carry the information of X. To overcome the over-smoothing problem, GCNII introduces initial residual and identity mapping techniques to the vanilla GCN. The initial residual constructs a skip connection from the input layer, thus ensuring that the final representation of each node retains at least a fraction of X. The identity mapping merely transfers the aggregated features to the next GCN layer without any parameterized embedding process. Each GCNII layer is characterized as:

H(l+1)=σ(((1-αl+1)D˜-12A˜D˜-12H(l)+αl+1X)((1-βl+1)IN+βl+1W(l+1))) (1)

where H(l) corresponds to the lth hidden feature representation, A˜ represents the normalized adjacency matrix, D˜ represents the degree matrix of A˜, IN denotes the identity matrix, and σ denotes the activation function. αl+1 and βl+1 are two hyperparameters where αl+1 controls the power of connection for the initial feature X to (l + 1)th GCN layer, and βl+1 controls the degree of merely transferring the aggregated features to the next GCN layer without any parameterized embedding.

Although many deep GCN models accelerate their performance by considering the multi-hop features in graphs, it is difficult to use them in environments with limited computing resources such as mobile or embedded systems due to their large model sizes. In this paper, we concentrate on compressing a deep GCN into a shallow GCN while preserving the multi-hop feature aggregation property of deep GCNs.

Knowledge distillation

Knowledge Distillation (KD) [17] transfers knowledge from a large teacher model into a smaller student model so that the student performs as well as the teacher. In the method, task predictions of a teacher is smoothed by the softmax function. Distillation of knowledge is done by making task predictions of the student be similar to that of the teacher. Several KD methods distill not only the output of teachers but also the information of intermediate hidden layers [18, 19]. [20] introduces intermediate-level hints from hidden layers of a teacher to guide a student to learn intermediate representations of the teacher. However, those methods aims to compress a wide and shallow teacher model into a thin and shallow student model; i.e., they do not focus on compressing a deep teacher GCN model into a shallow GCN model. Thus, they have limitations in compressing multiple GCN layers into few GCN layers.

Recently, to our best knowledge, the first KD method on GCNs based on Local Structure Preserving (LSP) module is proposed in [21]. In the module, topological semantics from both the teacher and the student are extracted as distributions, and the topology-aware knowledge transfer is done by minimizing the distance between these distributions. However, LSP only transfers the intermediate knowledge not considering task predictions which is specially designed for the objective task. Furthermore, LSP does not consider the teacher’s knowledge on multi-hop feature aggregation in a student although the process is essentially involved in a deep GCN. Therefore, its performance about preserving the accuracy is limited, especially for compressing a deep GCN.

Proposed method

In this section, we propose MustaD (Multi-Staged Knowledge distillation), a novel approach for effectively compressing a deep GCN by distilling multi-staged knowledge from a teacher.

We summarize the challenges and our ideas in developing our distillation method while preserving the multi-hop feature aggregation of the deep-layered teacher.

  1. When compressing a deep teacher GCN model to a small student GCN model by distilling knowledge from the teacher model, it is essential to conserve the multi-hop feature aggregation of the deep model as the aggregation is the key purpose of stacking multiple GCN layers. We propose to use a single effective layer that imitates the K GCN layers in the teacher model by a single GCN layer in the student while preserving the multi-hop feature aggregation process and reducing the model size significantly.

  2. It is also important to decide what knowledge to be distilled to preserve the performance of the teacher model in the student model. We propose multi-staged knowledge distillation that distills not only the knowledge of the teacher model’s task predictions but also its final hidden embeddings to the student model. By distilling the knowledge of the final hidden embeddings, the student model generates its final representation similar to that of the teacher model; thus, the multi-staged knowledge distillation helps the single effective layer imitate the multiple GCN layers.

Firstly, we describe how to preserve multi-hop feature aggregation of the teacher model in a single effective student network based on the observation of the fundamental mechanism of deep GCNs. Then we describe the knowledge distillation of embeddings as well as task predictions, followed by the explanation of the final loss function for jointly training all of them for the node classification task. At last, we give a spectral analysis of MustaD when distilling the knowledge of GCNII teacher model to strengthen the theoretical background of our method.

Preserving multi-hop feature aggregation

We describe how MustaD preserves the feature aggregation procedure of deep GCN layers of the teacher in a single GCN layer of the student. The main purpose of deep GCN is to consider multi-hop neighbors using multiple GCN layers. Let G=(V,E) denote an input graph where V and E denote the sets of nodes and edges, respectively. Given the graph G, a GCN layer is expressed by

hi(k+1)=GCN(k+1)(hi(k)):=AggregationjNii(Embk+1(hj(k))) (2)

where hi(k) denotes the hidden feature embedding for node i in the k-th GCN layer, Ni denotes the set of one-hop neighbors of node i in G, and Embk(⋅) is a learnable function that maps a given feature onto a new embedding space, which is used in the k-th GCN layer. According to Eq (2), a GCN layer aggregates hidden features from one-hop neighbors to obtain new hidden features by Aggregation(⋅). Thus, when a model uses K GCN layers, it aggregates hidden features from up to K-hop neighbors.

Given a teacher model having K GCN layers, our MustaD preserves the process by imitating the teacher’s multi-hop feature aggregation in a single effective layer which is represented by the following equation:

hi(k+1)=GCNs(hi(k))fork=1,2,,K (3)

where GCNs(⋅) indicates a shared GCN layer in the student model, and hi(k) denotes the hidden embedding of node i at k-th iteration in the student. In other words, MustaD repeats GCNs(⋅) K times in the student model to imitate the teacher’s multi-hop aggregation as shown in Fig 1. Thus, our model reduces the number of model parameters by compressing multiple GCN layers into a single layer while effectively considering multi-hop feature aggregation.

Distilling knowledge from trained deep GCNs

MustaD distills the teacher’s multi-staged knowledge of embeddings and task predictions to the student as depicted in Fig 1.

Distilling knowledge of embeddings

MustaD distills the last hidden embeddings after K-hop aggregations of the teacher into the student. This distillation guides the student to follow the teacher’s behavior more carefully. The main idea for the distillation is to make embeddings of both the teacher and the student similar by minimizing the following loss function:

Lemb=meaniV(K(h˜i;s,hi;t)) (4)

where hi;t is the teacher’s last hidden embedding of node i, h˜i;s=Wshi;s where hi;s is the student’s last embedding of node i, and Ws is a learnable weight matrix used to match the dimension between the teacher and the student. The matching layer is omitted if they have the same hidden dimension. K(·) is a kernel function to measure the distance between the two given embedding vectors, and any distance metric can be used. In this work, we investigate the effect of kernel functions among the following metrics:

K(h˜i;s,hi;t)={h˜i;s-hi;tp(Distance-basedkernel)h˜i;shi;t(Linearkernel)(h˜i;shi;t+c)d(Polynomialkernel)exp(-h˜i;s-hi;t222σ2)(RBFkernel)jh˜i,j;slog(h˜i,j;shi,j;t)(KLdivergence-basedkernel) (5)

where hi,j;t and hi,j;s denote the j-th element of hi;t = Softmax(hi;t) and hi;s = Softmax(hi;s), respectively.

Distilling knowledge of predictions

Distilling the knowledge of task predictions follows the process proposed in [17] that minimizes the following loss function:

Lpred=meaniV(DKL(pi;s||pi;t)) (6)

where DKL(·) is the Kullback–Leibler divergence, pi;s denotes the prediction probability vector of the student after passing through a softmax function, and pi;t denotes that of the teacher after passing through a softmax function conditioned with temperature T [17]. The distillation of task prediction guides the student to obtain similar predictive outputs as the teacher.

Final loss function for node classification

The student model aims to solve the node classification task like the teacher model does. Thus, the student model directly learns the task as well as the aforementioned distillations by minimizing the following cross entropy loss:

Lce=-iV*jCyijlogpij;s (7)

where V* is the set of nodes with labels, and C is the set of labels. yij is an indicator that is 1 if a node i belongs to label j, and 0 otherwise. pij;s is the probability that the node i belongs to a label j, which is predicted by the student. Note that Eq (7) assumes that each node belongs to only one class. If a node has multiple labels (i.e., multi-labeled node classification), we use binary cross entropy loss instead.

To jointly train for all of the aforementioned aspects, MustaD minimizes the following final loss:

L=Lce+λembLemb+λpredLpred (8)

where λpred and λemb are hyperparameters to balance the proposed loss terms.

Spectral analysis of MustaD

Spectral graph methods have become fundamental tools in the analysis of large networks [3234]. GCN [1] has attracted a lot of attention due to its successful implementation of graph convolution defined on a spectral domain as a simple matrix multiplication, thus achieving superior performance compared to other models. In this section, we first give a brief interpretation of K-layer GCN on the spectral domain. Then we give a spectral analysis of MustaD when distilling the knowledge of GCNII teacher model, comparing the expressiveness of our MustaD to that of K-layer GCN on the spectral domain.

Consider an adjacency matrix A˜RN×N of a graph with self-loop, and a graph signal xRN which is a set of values residing on a set of nodes, where N is the number of nodes. A polynomial filter of order K on the graph signal x is defined as

K-orderpolynomialfilteronx=(k=0KθkL˜k)x. (9)

where L˜=IN-D˜-1/2A˜D˜-1/2 is the normalized Laplacian matrix of A˜, and θlR is the polynomial coefficient. D˜ and INRN×N represent the degree matrix of A˜ and the identity matrix, respectively. [31] proves that a K-layer GCN simulates a polynomial filter of order K with dependent coefficients θl’s, which is the interpretation of K-layer GCN on the spectral domain. We show that the student distilled by our proposed MustaD also simulates the K-order polynomial filter with inter-dependent coefficients using only a linear transformation layer and a single effective layer, therefore has a similar expressiveness to the K-layer GCN.

Each layer of a teacher that uses GCNII architecture is represented as follows:

H(l+1)=σ(((1-αl+1)D˜-12A˜D˜-12H(l)+αl+1X)((1-βl+1)Id+βl+1W(l+1))) (10)

where XRN×d and σ denote the input feature matrix and the activation function (ReLU), respectively. αl+1R and βl+1R are two hyperparameters. W(l+1)Rd×d represents a learnable weight matrix in the (l + 1)th GCN layer. H(l)RN×d corresponds to the lth hidden feature representation; i.e., each node has a hidden feature vector of length d. The initial hidden representation H(0) is obtained by a linear transformation of X, expressed by H(0) = X W(0). Note that the dimensions of hidden representations for every GCN layers are the same as that of the initial feature vector since there is a residual connection to the input feature matrix X.

As we are dealing with a graph signal xRN instead of the input feature matrix X, Eq (10) changes to

h(l+1)=σ(((1-αl+1)D˜-12A˜D˜-12h(l)+αl+1x)((1-βl+1)+βl+1wl+1))=σ(((1-αl+1)D˜-12A˜D˜-12h(l)+αl+1x)γl+1). (11)

where wl+1R is a learnable parameter, γl+1=(1-βl+1)+βl+1wl+1, and h(l)RN represents the lth hidden feature representation; i.e., each node has a hidden representation of length 1. The initial hidden representation h(0) is obtained by a linear transform of x which is expressed by h(0) = x w0.

Theorem 1Consider a K-layer GCNII teacher model. A student of the teacher distilled by MustaD expresses a K-order polynomial filter(k=0KθkL˜k)with inter-dependent coefficients θks for k ∈ {0, ⋯, K} in the following simple form

θk={γ(-γ)k-s=k+1Kθs(sk)wherek{0,1,,K-1}w0(-γ)kwherek=K. (12)

Proof. We consider a weaker version of the teacher model used in [13], by assuming x of the signal vector to be non-negative and αl+1 = 1/2. Furthermore, we remove the ReLU operation since the input feature x is non-negative as denoted in [13]. Thus, Eq (11) is simplified to the following:

h(l+1)=σ((D˜-12A˜D˜-12h(l)+x)γl+1)=γl+1(D˜-12A˜D˜-12h(l)+x)=γl+1((IN-L˜)h(l)+x) (13)

where γl+1=γl+1/2, and L˜ is a normalized Laplacian matrix of the adjacency matrix A˜. Since MustaD uses the repeated single effective layer instead of the K discrete GCN layers, we set γl+1’s to a single parameter γ for l ∈ {0, ⋯, K − 1}. Consequently, recursive computations of Eq (13) yield h(K) of the final representation from the single effective layer as follows:

h(K)=(l=0K(k=K-lKγk)(IN-L˜)l)x (14)

where γk = γ for k ∈ {1, ⋯, K}, and γ0 = w0.

On the other hand, a K-order polynomial filter of an adjacency matrix A˜ on a graph signal x is expressed by the equation below:

(k=0KθkL˜)x=(k=0Kθk(IN-(IN-L˜))k)x=(k=0Kθk(l=0k(-1)l(kl)(IN-L˜)l))x=(l=0K(k=lKθk(-1)l(kl))(IN-L˜)l)x. (15)

To show that the student of a K-layer GCNII teacher distilled by MustaD expresses a K-order polynomial filter with inter-dependent coefficients, we prove that all θk’s for k ∈ {0, 1, ⋯, K} in Eq (15) are expressed by w0 and γ. Specifically, we show that all θk’s in the following equation

k=K-lKγk=k=lKθk(-1)l(kl) (16)

are expressed by w0 and γ for all k ∈ {0, 1, ⋯, K} where γk = γ for k ∈ {1, ⋯, K}, γ0 = w0, and l ∈ {0, 1, ⋯, K}. When l = K, θK is expressed by w0 and γ as follows:

θK=w0(-γ)K. (17)

Recursive computations express all θk’s by w0 and γ as follows:

θK-1=γ(-γ)K-1-θK(KK-1)θK-2=γ(-γ)K-2-θK-1(K-1K-2)-θK(KK-2)θ0=γ-θ1-θ2--θK. (18)

In conclusion, a general expression of θk in Eq (16) is expressed by

θk={γ(-γ)k-s=k+1Kθs(sk)wherek{0,1,,K-1}w0(-γ)kwherek=K (19)

which is our desired objective.

Experiments

We perform experiments to answer the following questions.

  • Q1. Prediction Accuracy How well does our MustaD preserve the multi-hop feature aggregation of a deep teacher model compared to other KD methods?

  • Q2. Parameters vs. Performance What is the trade-off between the number of parameters and the accuracy in student models?

  • Q3. Ablation Study How effectively do the multi-staged distillation and the single effective layer help a student conserve teacher’s performance?

Experimental setup

Dataset

We use four graph datasets as summarized in Table 2. Cora, Citeseer, and Pubmed are citation datasets where nodes and edges represent documents and citations, respectively. Each node feature indicates whether a word is included in each document. The ogbn-proteins dataset is an undirected and weighted graph where nodes represent proteins and edges mean different types of biological associations between proteins. An edge in the graph has an 8-dimensional feature, and a node has an 8-dimensional one-hot feature indicating which species the corresponding protein comes from.

Table 2. Dataset statistics.
Dataset Classes Nodes Edges Features
Cora1 [35] 7 2,708 5,429 1,433
Citeseer1 [35] 6 3,327 4,732 3,703
Pubmed1 [35] 3 19,717 44,338 500
ogbn-proteins2 [36, 37] 112 132,534 39,561,252 8

Teacher models

We perform the distillation from two different teacher models. The first teacher model, GCNII [13] uses initial residual and identity mapping techniques and achieved state-of-the-art performance in Cora, Citeseer, and Pubmed. We compress the trained GCNII teacher in those three datasets. The second teacher model, GEN [8] proposes generalized message aggregators and pre-activation residual connections; GEN achieves a good performance in the ogbn-proteins dataset. We perform distillation from a trained GEN teacher in the ogbn-proteins dataset. When reproducing the teacher model, experimental settings such as data split, optimizer, regularization, activation functions, and hyperparameters follow those of [8, 13] unless explicitly stated.

Competitors

We compare MustaD with the following competitors:

  • KD [17] is the model for knowledge distillation. It softens task predictions of the teacher and distills the knowledge of classes to a student. We denote the student distilled by this method as Student_KD. The final loss is:
    L=Lce+λpredLpred (20)
  • LSP method [21] distills an embedded topological structure of a teacher and achieved the best performance in graph-structured datasets. We denote the student trained by this method as Student_LSP. The final loss function is computed by:
    L=Lce+λLSPLLSP (21)

All methods are implemented by PyTorch and PyTorch Geometric [38]. We use a machine with Intel E5-2630 v4 2.2GHz CPU and Geforce GTX 2080 Ti for the experiments.

Semi-supervised node classification

Cora, Citeseer, and Pubmed

We perform distillation on trained GCNII models in Cora, Citeseer, and Pubmed. In particular, we perform KD from teachers with varying numbers of layers to show how well MustaD preserves the multi-hop feature aggregation of the teacher. When reproducing teacher models, we use the same settings as [13]. When training students, the early stopping patience is increased from 100 epochs to 200 epochs to get more stable results. Student_Base is a model trained with the ground truth labels without the teacher. Student_MustaD is our distilled student that has the same hidden feature dimension to the teacher. We train Student_KD with λpred of 0.1, 0.1, and 100 on Cora, Citeseer, and Pubmed, respectively. For Student_LSP, we set λLSP to 10 on both Cora and Citeseer. Student_LSP fails to be trained in Pubmed as every training node has only one neighbor which means there is no local structure to be distilled [13]. Other hyperparameters for each competitor are tuned to obtain the best results on the validation set. For our Student_MustaD, we set λpred to 1, 0.1, and 100, λemb to 0.01, 0.01, and 10 in Cora, Citeseer, and Pubmed, respectively, and the kernel function to KL divergence.

Table 3 shows the overall results on node classification in terms of mean accuracy after 50 runs. Note that our MustaD gives the best performance in terms of accuracy. In particular, Student_MustaD presents 3.77 ∼ 4.21%p improvement to the second-best model with 3.00 ∼ 6.04× smaller model size than the best teacher. Furthermore, the performance of the proposed MustaD increases as the number of layers in the teacher increases, unlike other KD methods. It indicates that MustaD preserves the aggregation process successfully whereas others do not. This also implies that MustaD gains more knowledge from the given input features when more GCN layers are used in the teacher.

Table 3. Semi-supervised node classification accuracy for Cora, Citeseer, and Pubmed.

We perform the distillation from trained teachers with various number of GCN layers: 2, 4, 6, 8, 16, 32, and 64. Student_MustaD is our distilled student that has the same hidden feature dimension as the teacher. Note that MustaD consistently outperforms other KD methods while preserving the multi-hop feature aggregation of the deep teacher.

Data Model Number of Parameters Number of GCN Layers in the Teacher
2 4 8 16 32 64
Cora Teacher [13] 354K
(64 layers)
81.83 82.92 84.13 84.56 85.28 85.34
Student_Base [13] 96K 79.71 79.71 79.71 79.71 79.71 79.71
Student_KD [17] 96K 80.05 80.12 80.31 80.54 80.76 79.41
Student_LSP [21] 96K 80.02 79.88 79.96 79.99 80.02 80.33
Student_MustaD 96K 82.35 82.33 82.92 84.58 84.52 84.71
Citeseer Teacher [13] 3,047K
(32 layers)
67.62 68.13 70.77 72.87 72.89 72.71
Student_Base [13] 1,015K 67.82 67.82 67.82 67.82 67.82 67.82
Student_KD [17] 1,015K 68.21 68.03 68.35 68.92 68.87 69.06
Student_LSP [21] 1,015K 68.32 68.26 68.27 68.29 68.36 68.21
Student_MustaD 1,015K 67.10 66.72 66.45 69.55 71.79 72.83
Pubmed Teacher [13] 1,178K
(16 layers)
78.59 77.94 78.13 80.35 79.95 79.96
Student_Base [13] 195K 75.61 75.61 75.61 75.61 75.61 75.61
Student_KD [17] 195K 75.71 75.87 76.01 76.03 75.84 75.98
Student_LSP [21] 195K - - - - - -
Student_MustaD 195K 76.01 78.42 78.75 79.69 79.73 80.24

In Citeseer and Pubmed, MustaD achieves the best performance when the student imitates 64 GCN layers of the teacher. However, the performance of the teacher decreases when more than 32 and 16 layers are stacked, respectively. It indicates that MustaD enables the student to aggregate information from farther nodes than the teacher does. If the accuracy of the teacher is too low, it is not easy for our student to show the remarkable performance consistently since MustaD aims to preserve the accuracy of deep teachers. However, the ability of MustaD to aggregate information from farther nodes than the teacher relieves the student’s strong dependence on the performance of the teacher.

ogbn-proteins

We perform knowledge distillation using trained GEN teacher model in the ogbn-proteins dataset. Since the ogbn-proteins dataset is dense and large, full-batch training is not easy. We apply a random node sampler to generate batches for both mini-batch training and testing. Following [8], we set each batch size to one subgraph. Thus, as the number of batch increases, the size of the subgraph in each batch decreases and that leads to a decreased performance. Through experiments, we increase the number of batches from 10 to 40 to fit the large graph in our GPU (GeForce GTX 2080Ti with 11GB of memory) whereas [8] uses NVIDIA V100 with 32GB of memory. As a result, the reproduced teacher achieves the best performance with 28 layers although [8] achieves that with 112 layers. Without loss of generality, we perform the distillation on the reproduced teacher and validate our MustaD compared to other methods.

We evaluate the performance of each method on a multi-labeled node classification task in a semi-supervised setting. We train Student_KD with λpred of 0.1 and Student_LSP with λLSP of 10. For competitors, every hyperparameters are tuned to obtain the best results on the validation set. For our model, we set λpred to 0.1, λemb to 0.01, and the kernel function to KL divergence.

Table 4 summarizes the results in terms of AUC-ROC. MustaD presents an 1.55%p improvement in terms of AUC-ROC from the second-best KD model while requiring 11.41× fewer parameters than the teacher. Tables 3 and 4 show that our MustaD achieves the state-of-the-art performance with various teacher models.

Table 4. Multi-labeled node classification performance (AUC-ROC) in ogbn-protein.

The distillations are done from trained teachers with different numbers of GCN layers: 3, 7, 14, 28, and 56. Note that the proposed method Student_MustaD provides the best performance among the student models.

ogbn-proteins Number of Parameters Number of GCN Layers in the Teacher
3 7 14 28 56
Teacher [8] 483K
(28 layers)
0.819 0.829 0.835 0.837 0.837
Student_Base [8] 42K 0.797 0.797 0.797 0.797 0.797
Student_KD [17] 42K 0.801 0.805 0.808 0.803 0.805
Student_LSP [21] 42K 0.798 0.799 0.798 0.799 0.798
Student_MustaD 42K 0.811 0.819 0.821 0.823 0.820

Parameters vs. performance

We perform a parameter study to show the trade-off between the number of parameters and accuracy. We vary the hidden feature dimension and the number of the effective layers in the student to vary the number of parameters. Furthermore, we vary the kernel functions used for distilling the knowledge of multi-hop feature representations and evaluate the performance. We analyze Cora with the trained 64-layered GCNII teacher.

Hidden feature dimension

We set the student’s hidden feature dimensions to be the same as that of the teacher in the previous section; it limits the degree of model compression. We study the trade-off between the hidden feature dimension and the accuracy in Table 5. In particular, we vary the feature dimension from 16 to 128.

Table 5. Trade-off between the hidden feature dimension and the accuracy.

Note that the proposed MustaD with the hidden feature dimension of 64 shows the best performance.

Hidden Feature Dimension 16 32 64 128
Number of Parameters 24K 49K 96K 193K
Accuracy (%) 79.13 83.01 84.71 84.42

The table shows that Student_MustaD with the hidden feature dimension of 64 achieves the best performance. Note that setting the same feature dimension for the student as that of the teacher shows the best performance even when significantly smaller number of layers are used. It is also noteworthy that Student_MustaD with the hidden feature dimension of 32 still shows the best performance among KD methods shown in Table 3, while requiring 1.96× fewer parameters than the competitors, and 7.22× fewer parameters than the teacher. When the feature dimension is set to 128, MustaD shows a lower performance than MustaD with the dimension of 64, due to overfitting.

Number of the effective layers

Our proposed MustaD compresses the hidden GCN layers of a teacher into a single effective layer in a student. We study how the accuracy of MustaD changes as the number of the effective layer increases. However, to increase the number of the effective layer, we have to set the number of teacher layers that each effective layer in the student imitates. Let M denote the number of the effective layers in the student. We tune the set L consisting of the number of teacher layers that each effective layer in the student imitates in [{1, 63}, {32, 32}, {63, 1}] for M = 2, [{1, 22, 41}, {21, 22, 21}, {41, 22, 1}] for M = 3, and [{1, 11, 21, 31}, {16, 16, 16, 16}, {31, 21, 11, 1}] for M = 4.

Fig 3 shows that students having more than one effective layer shows a similar performance to the student having a single effective layer. It indicates that a single effective layer in the student is enough to conserve the multi-hop feature aggregation process of the teacher.

Fig 3. Accuracy of MustaD for different numbers of the effective layers in a student.

Fig 3

M represents the number of effective layers in the student, and L corresponds to the set consisting of the number of teacher layers that each effective layer in the student imitates; e.g., M = 2 and L = {1, 63} denotes that the two effective layer in the student imitates a GCN layer, and 63 GCN layers of the teacher, respectively. Note that MustaD, which has a single effective layer in the student, is enough to conserve the multi-hop feature aggregation of the teacher.

Kernel functions

MustaD uses various kernel functions (Eq 5) to distill the knowledge of multi-hop feature representations from the teacher. We compare students with different kernel functions in Cora and show the result in Table 6. We set p to 2 for the distance based kernel. For the polynomial kernel, c and d are set to 2 and 0, respectively. For the RBF kernel, σ is set to 1. In Table 6, ‘None’ represents the student model without the multi-staged knowledge distillation; i.e., it distills only the task prediction, not the embedding.

Table 6. Accuracy with different kernel functions in the Cora dataset.

Note that KL divergence-based kernel provides the best accuracy, and the student ‘None’ without the embedding distillation shows a poor performance.

Kernel Function None L2 Norm Linear Poly RBF KL Divergence
Accuracy (%) 84.29 84.61 84.60 84.47 84.40 84.71

Note that the ‘None’ student shows a worse performance than students which distill embeddings with kernel functions. Among the kernel functions, KL divergence shows the best accuracy.

Ablation study

We provide ablation studies for the effect of multi-staged knowledge distillation of the teacher, and the single effective layer in the student. The studies are done in three citation datasets; Cora, Citeseer, and Pubmed.

Multi-staged knowledge distillation

MustaD distills a teacher’s knowledge in a multi-staged manner to conserve the accuracy. We show the effect of multi-staged knowledge distillation in Fig 4. Note that Student_MustaD (Without Multi-staged KD) is trained without distilling the knowledge of multi-hop feature representations; i.e., the teacher distills only the knowledge of task prediction to the student.

Fig 4. Accuracy of MustaD without the multi-staged knowledge distillation.

Fig 4

Note that Student_MustaD (Without Multi-staged KD) is trained without distilling the knowledge of multi-hop feature representations; i.e., the teacher distills only the knowledge of task prediction to the student. MustaD with the distillation consistently shows a better performance compared to MustaD without it.

If we distill only the knowledge of task prediction, the teacher’s error of prediction directly propagates to the student. However, the distillation of multi-hop features compensates for the error, and thus MustaD with the distillation presents a superior performance compared to MustaD without it as depicted in Fig 4. In other words, the multi-staged knowledge distillation takes a crucial role in acquiring proper knowledge from the teacher.

Single effective layer

MustaD imitates the multi-hop feature aggregation process of a teacher by a single effective layer. We investigate the effect of the single effective layer by comparing the proposed MustaD to a student with a single naive GCN layer.

Fig 5 shows that MustaD without the single effective layer presents significantly lower performance than the original MustaD. Furthermore, the accuracy of MustaD without the effective layer does not improve as the number K of GCN layers in the teacher increases, whereas the performance of MustaD with that improves as K increases. This is because the method preserves the teacher’s multi-hop feature aggregation, which is the main purpose of the multiple layers in the teacher, by a single effective layer.

Fig 5. Accuracy of MustaD without the single effective layer.

Fig 5

Note that the student without the single effective layer shows significantly lower performance than the original MustaD.

Conclusion

In this work, we have proposed MustaD, an accurate method for compressing deep graph convolution networks (GCNs) by distilling multi-staged knowledge from a teacher. MustaD distills the teacher’s knowledge of multi-hop feature aggregation by imitating the multiple GCN layers using a single effective layer in a student, which reduces the model size significantly, and by transferring the final hidden feature embeddings of the teacher to the student. MustaD also distills the knowledge of task prediction by transferring the prediction of the teacher. We give a theoretical analysis of MustaD, comparing the expressiveness of the proposed method to that of multi-layered GCN on a spectral domain. MustaD achieves the state-of-the-art performance in four real-world datasets, preserving the multi-hop feature aggregation of the teacher, compared to other distillation based GCN compression methods. Future works include extending MustaD to consider the semantics of features.

Data Availability

The source code and the datasets are available at https://github.com/snudatalab/MustaD.

Funding Statement

This work was supported in part by IITP grant funded by the Korea government [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)], in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2020-0-00894, Flexible and Efficient Model Compression Method for Various Applications and Environments), and in part by the ICT R&D program of MSIT/IITP (No.2017-0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test). The Institute of Engineering Research and ICT at Seoul National University provided research facilities for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. CoRR. 2016;.
  • 2.Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks. CoRR. 2017;.
  • 3.Hamilton WL, Ying Z, Leskovec J. Inductive Representation Learning on Large Graphs. In: Advances in Neural Information Processing Systems; 2017. [Google Scholar]
  • 4.Kipf TN, Welling M. Variational Graph Auto-Encoders. CoRR. 2016;.
  • 5.Zhang M, Chen Y. Link Prediction Based on Graph Neural Networks. In: NeurIPS; 2018.
  • 6.Dettmers T, Minervini P, Stenetorp P, Riedel S. Convolutional 2D Knowledge Graph Embeddings. In: AAAI; 2018.
  • 7.Li G, Müller M, Thabet AK, Ghanem B. DeepGCNs: Can GCNs Go As Deep As CNNs? In: ICCV; 2019.
  • 8.Li G, Xiong C, Thabet AK, Ghanem B. Deepergcn: All you need to train deeper gcns. CoRR. 2020;.
  • 9.Henaff M, Bruna J, LeCun Y. Deep Convolutional Networks on Graph-Structured Data. CoRR. 2015;.
  • 10.Sun K, Lin Z, Zhu Z. AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models. CoRR. 2019;.
  • 11.Rong Y, Huang W, Xu T, Huang J. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In: ICLR; 2020.
  • 12.Rong Y, Huang W, Xu T, Huang J. The Truly Deep Graph Convolutional Networks for Node Classification. CoRR. 2019;.
  • 13.Chen M, Wei Z, Huang Z, Ding B, Li Y. Simple and Deep Graph Convolutional Networks. CoRR. 2020;.
  • 14.Srinivas S, Babu RV. Data-free Parameter Pruning for Deep Neural Networks. In: BMVC; 2015.
  • 15.Tai C, Xiao T, Wang X, E W. Convolutional neural networks with low-rank regularization. In: ICLR; 2016.
  • 16.Han S, Mao H, Dally WJ. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In: ICLR; 2016.
  • 17.Hinton GE, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network. CoRR. 2015;.
  • 18.Kim Y, Rush AM. Sequence-Level Knowledge Distillation. In: EMNLP; 2016.
  • 19.Chen G, Choi W, Yu X, Han TX, Chandraker M. Learning Efficient Object Detection Models with Knowledge Distillation. In: NIPS; 2017.
  • 20.Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. FitNets: Hints for Thin Deep Nets. In: ICLR; 2015.
  • 21.Yang Y, Qiu J, Song M, Tao D, Wang X. Distilling Knowledge From Graph Convolutional Networks. In: CVPR; 2020.
  • 22.SRIVASTAVA G, MADDIKUNTA PKR, GADEKALLU TR. A Two-stage Text Feature Selection Algorithm for Improving Text Classification. Asian Language Information Processing. 2021;20. [Google Scholar]
  • 23.Imtiaz SI, ur Rehman S, Javed AR, Jalil Z, Liu X, Alnumay WS. DeepAMD: Detection and identification of Android malware using high-efficient Deep Artificial Neural Network. Future Gener Comput Syst. 2021;115:844–856. doi: 10.1016/j.future.2020.10.008 [DOI] [Google Scholar]
  • 24.Rehman A, Ur Rehman S, Khan M, Alazab M, G TR. CANintelliIDS: Detecting In-Vehicle Intrusion Attacks on a Controller Area Network using CNN and Attention-based GRU. IEEE Transactions on Network Science and Engineering. 2021; p. 1–1. [Google Scholar]
  • 25.Shankar GS, Palanivinayagam A, Vinayakumar R, Ghosh U, Mansoor W, Alnumay WS. An Embedded-Based Weighted Feature Selection Algorithm for Classifying Web Document. Wirel Commun Mob Comput. 2020;2020:8879054:1–8879054:10. [Google Scholar]
  • 26.Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi VF, Malinowski M, et al. Relational inductive biases, deep learning, and graph networks. CoRR. 2018;.
  • 27.Zou D, Hu Z, Wang Y, Jiang S, Sun Y, Gu Q. Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks. In: NIPS; 2019.
  • 28.Ba J, Caruana R. Do Deep Nets Really Need to be Deep? In: NIPS; 2014.
  • 29.Yim J, Joo D, Bae J, Kim J. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In: CVPR; 2017.
  • 30.Huang Z, Wang N. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. CoRR. 2017;.
  • 31.Wu F, Jr AHS, Zhang T, Fifty C, Yu T, Weinberger KQ. Simplifying Graph Convolutional Networks. In: ICML; 2019.
  • 32.Sakumoto Y, Kameyama T, Takano C, Aida M. Information Propagation Analysis of Social Network Using the Universality of Random Matrix. IEICE Trans Commun. 2019;.
  • 33.Hammond DK, Vandergheynst P, Gribonval R. Wavelets on Graphs via Spectral Graph Theory. CoRR. 2009;abs/0912.3848.
  • 34.Ng AY, Jordan MI, Weiss Y. On Spectral Clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada]. MIT Press; 2001. p. 849–856. [Google Scholar]
  • 35.Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T. Collective Classification in Network Data. AI Mag. 2008;.
  • 36.Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;. doi: 10.1093/nar/gky1131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Consortium TGO. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;. doi: 10.1093/nar/gky1055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Fey M, Lenssen JE. Fast Graph Representation Learning with PyTorch Geometric. CoRR. 2019;.

Decision Letter 0

Thippa Reddy Gadekallu

28 May 2021

PONE-D-21-15818

MustaD: Compressing Deep Graph Convolution Network with Multi-Staged Knowledge Distillation

PLOS ONE

Dear Dr. Kang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR:

Based on the comments received from the reviewers and my own observation, I recommend major revisions for the paper. The authors should carefully address all the comments and suggestions from the reviewers. Also, the authors should proofread to polish the English grammar in the paper.

 ==============================

Please submit your revised manuscript by Jul 12 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Thippa Reddy Gadekallu

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1) Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2) Please amend either the abstract on the online submission form (via Edit Submission) or the abstract in the manuscript so that they are identical.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. Does the proposed model has the ability to gain the full knowledge about all the features when there are multiple layers?

2. Does the author considers the semantic problems of features in the proposed method?

3. What happens to the Knowledge Distillation when the teacher’s accuracy is too low? Does the proposed system helps in solving these issues?

4. In the process of KD, Does the training errors in teacher propagate to student directly? Can the proposed system identify the errors?

Please cite the following papers

1. Ashokkumar P, Siva Shankar G, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A Two-stage Text Feature Selection Algorithm for Improving Text Classification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 3, Article 49 (April 2021), 19 pages. DOI:https://doi.org/10.1145/3425781

2. G. Siva Shankar, P. Ashokkumar, R. Vinayakumar, Uttam Ghosh, Wathiq Mansoor, Waleed S. Alnumay, "An Embedded-Based Weighted Feature Selection Algorithm for Classifying Web Document", Wireless Communications and Mobile Computing, vol. 2020, Article ID 8879054, 10 pages, 2020. https://doi.org/10.1155/2020/8879054

Reviewer #2: - The quality of the figures can be improved more. Figures should be eye-catching. It will enhance the interest of the reader.

The abstract is long and NOT satisfactory. It should contain the following parts:

i. The importance of or motivation for the research.

ii. The issue/argument of the research.

iii. The methodology.

iv. The result/findings.

v. The implications of the result/findings.

- Please highlight the contribution clearly in the introduction

- In the first four paragraphs of literature review section, the authors have presented a good references, but they need to present the recent and most updated references.

- In the literature review section, you need to be consistent in the use of the verb tense, it is common to use the past tenses.

- The summary at the end of the literature review should be focused on the limitations of related work.

- The discussion is very important in research paper. Nevertheless, this section is short and should be presented completely.

- Major contribution was not clearly mentioned in the conclusion part.

- Make sure the Conclusion succinctly summarizes the paper. It should not repeat phrases from the Introduction!

- Authors should add the most recent reference:

1)  CANintelliIDS: Detecting In-Vehicle Intrusion Attacks on a Controller Area Network using CNN and Attention-based GRUCANintelliIDS: Detecting In-Vehicle Intrusion Attacks on a Controller Area Network using CNN and Attention-based GRU

2) DeepAMD: Detection and identification of Android malware using high-efficient Deep Artificial Neural Network, Future Generation Computer Systems 115, 844-856

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Aug 13;16(8):e0256187. doi: 10.1371/journal.pone.0256187.r002

Author response to Decision Letter 0


6 Jul 2021

Reviewer #1

(R1-1) Does the proposed model has the ability to gain the full knowledge about all the features when there are multiple layers?

– (A1-1) Accuracy of MustaD improves as the number of GCN layers in the teacher increases. This implies that MustaD gains more knowledge from the given input features when more GCN layers are used in the teacher. In other words, MustaD has the ability to gain the full knowledge about all the features when there are multiple layers in the teacher. We reported this in lines 334-336 in experiments section.

• (R1-2) Does the author considers the semantic problems of features in the proposed method?

– (A1-2) Since the main purpose of MustaD is to compress a deep GCN model to a compact model while preserving the multi-hop feature aggregation of the deep model, it is hard to expect that MustaD considers the semantic problems of features. We added the discussion on this as a future work in conclusion section (lines 447-448).

• (R1-3) What happens to the Knowledge Distillation when the teachers accuracy is too low? Does the proposed system helps in solving these issues?

– (A1-3) Through the experiments summarized in Table 3, we have shown that MustaD has an ability to aggregate information from farther nodes than the teacher, thus relieving the student’s strong dependence on the performance of the teacher. We reported this in lines 341-345 in experiments section.

• (R1-4) In the process of KD, does the training errors in teacher propagate to student directly? Can the proposed system identify the errors?

– (A1-4) Fig 4 shows that MustaD compensates for the training error of merely propagating the prediction of the teacher. We discussed this in lines 420-425 in experiments section.

• (R1-5) Please cite the following papers: 1) Ashokkumar P, Siva Shankar G, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A Two-stage Text Feature Selection Algorithm for Improving Text Classification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 3, Article 49 (April 2021), 19 pages. DOI:https://doi.org/10.1145/3425781, and 2) G. Siva Shankar, P. Ashokkumar, R. Vinayakumar, Uttam Ghosh, Wathiq Mansoor, Waleed S. Alnumay, ”An Embedded-Based Weighted Feature Selection Algorithm for Classify- ing Web Document”, Wireless Communications and Mobile Computing, vol. 2020, Article ID 8879054, 10 pages, 2020. https://doi.org/10.1155/2020/8879054

– (A1-5) We added the two references (lines 65-67 in related work section).

Reviewer #2

(R2-1) The quality of the figures can be improved more. Figures should be eye-catching. It will enhance the interest of the reader.

– (A2-1) We improved Figures 1 and 2. In particular, we redrew Fig 1 to be more informative, which previously looked complicated, and added markups for Fig 2 to be eye-catching.

• (R2-2) The abstract is long and NOT satisfactory. It should contain the following parts: 1) The importance of or motivation for the research, 2) The issue/argument of the research, 3) The methodology, 4) The result/findings, and 5) The implications of the result/findings.

– (A2-2) We rewrote the abstract to contain those five parts.

• (R2-3) Please highlight the contribution clearly in the introduction.

– (A2-3) We revised introduction section to emphasize the contribution of MustaD (lines 32-43 in introduction section). Furthermore, the contents of the contribution list have also been revised to clearly highlight the contributions (lines 54-56 in introduction section).

• (R2-4) In the first four paragraphs of literature review section, the authors have presented a good references, but they need to present the recent and most updated references.

– (A2-4) We added four up-to-date references (lines 65-67 in related work section).

• (R2-5) In the literature review section, you need to be consistent in the use of the verb tense, it is common to use the past tenses.

– (A2-5) We reviewed the literature review section to be consistent in the use of the verb tense.

• (R2-6) The summary at the end of the literature review should be focused on the limitations of related work.

– (A2-6) We added the limitations of deep GCNs at the end of the literature review (lines 104-106 in related work section).

• (R2-7) The discussion is very important in research paper. Nevertheless, this section is short and should be presented completely.

– (A2-7) We included more complete discussion (lines 334-336, lines 341-345, lines 380-387, lines 420-425, and lines 430-433 in experiments section).

• (R2-8) Major contribution was not clearly mentioned in the conclusion part.

– (A2-8) We revised the conclusion part to clearly present our major contribution.

• (R2-9) Make sure the Conclusion succinctly summarizes the paper. It should not repeat phrases from the Introduction!

– (A2-9) We revised conclusion section so that the conclusion succinctly summarizes the paper.

• (R2-10) Authors should add the most recent reference: 1) CANintelliIDS: Detecting In-Vehicle Intrusion Attacks on a Controller Area Network using CNN and Attention-based GRU, and 2) DeepAMD: Detection and identification of Android malware using high-efficient Deep Artificial Neural Network, Future Generation Computer Systems 115, 844-856.

– (A2-10) We added the two references (lines 65-67 in related work section).

Attachment

Submitted filename: rebuttal_letter.pdf

Decision Letter 1

Yuchen Qiu

2 Aug 2021

MustaD: Compressing Deep Graph Convolution Network with Multi-Staged Knowledge Distillation

PONE-D-21-15818R1

Dear Dr. Kang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yuchen Qiu, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: I would like to accept this paper now.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Acceptance letter

Yuchen Qiu

5 Aug 2021

PONE-D-21-15818R1

Compressing Deep Graph Convolution Network with Multi-Staged Knowledge Distillation

Dear Dr. Kang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yuchen Qiu

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: rebuttal_letter.pdf

    Data Availability Statement

    The source code and the datasets are available at https://github.com/snudatalab/MustaD.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES