PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks

Seyyede Fatemeh Seyyedsalehi; Mahdieh Soleymani; Hamid R Rabiee; Mohammad R K Mofrad

doi:10.1371/journal.pone.0244430

. 2021 Feb 25;16(2):e0244430. doi: 10.1371/journal.pone.0244430

PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks

Seyyede Fatemeh Seyyedsalehi ^1,², Mahdieh Soleymani ^1,^*, Hamid R Rabiee ^1,^*, Mohammad R K Mofrad ²

Editor: Alexandros Iosifidis³

PMCID: PMC7906332 PMID: 33630862

Abstract

Understanding the functionality of proteins has emerged as a critical problem in recent years due to significant roles of these macro-molecules in biological mechanisms. However, in-laboratory techniques for protein function prediction are not as efficient as methods developed and processed for protein sequencing. While more than 70 million protein sequences are available today, only the functionality of around one percent of them are known. These facts have encouraged researchers to develop computational methods to infer protein functionalities from their sequences. Gene Ontology is the most well-known database for protein functions which has a hierarchical structure, where deeper terms are more determinative and specific. However, the lack of experimentally approved annotations for these specific terms limits the performance of computational methods applied on them. In this work, we propose a method to improve protein function prediction using their sequences by deeply extracting relationships between Gene Ontology terms. To this end, we construct a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the annotation process. In addition to the baseline algorithms, we compare our method with two recently proposed deep techniques that attempt to utilize Gene Ontology term correlations. Our results confirm the superiority of the proposed method compared to the previous works. Moreover, we demonstrate how our model can effectively help to assign more specific terms to sequences.

Introduction

Proteins are one of the most important macro-molecule families in biology. Each protein is responsible for one or more functions in biological pathways and discovering these functions leads to a deeper understanding of biological mechanisms. This is also critical in designing different disease treatments. However, available in-laboratory techniques to discover protein functions are expensive and time-consuming. Thus, researchers tend to employ computational techniques which are able to infer protein functionalities from other biological data sources including protein structures [1], protein-protein interaction networks [2–4], protein sequences [5], or any combination of them [6–9]. Nowadays, due to improvements in sequencing technologies, a large number of protein sequences are available. The UniProtKB database [10] stores more than 70 million sequences where the functionality of around one percent of them is experimentally approved [5]. Experimental methods for gathering other data sources like interaction networks are more costly and noisier than current sequencing technologies and for the majority of proteins the only available data is their sequence [5]. These facts imply the importance of developing sequence-based computational methods for protein function prediction, which is addressed in the current research.

The previous works on this topic can be divided into three categories [11]: The first category contains alignment-based algorithms which assume that homologous protein sequences have the same functionalities [12–14]. The second group is based on finding specific motifs in sequences. These motifs are functional sites, which are considered as signatures of special functions [11]. The last group includes those methods that are based on machine learning and also are able to extract meaningful and high-level features from raw sequences. The well-known benchmarks for computational methods in protein function prediction, like Critical Assessment of Functional Annotation (CAFA) [15], confirm the superiority of machine learning methods compared to the other categories because machine learning techniques are capable to extract higher level features from raw protein sequences [11].

The most well-known database for annotating proteins is Gene Ontology (GO) [16] which was introduced in 1998 to describe the functionality of genes and their products including proteins. This database includes more than 40000 terms in a Directed Acyclic Graph (DAG). A protein can be assigned to more than one GO term. For example, in SwissProt [17], as the most important annotated subset of UniProtKB, around 71 GO terms are assigned to each human protein on average [5]. Moreover, in the GO structure, every term is a more specific version of its parents and whenever a term is assigned to a protein, all of its parents should also be assigned to it. In this context, protein function prediction can be described as a multi-label classification problem in which the DAG structure of GO imposes a redundancy in the label space. Moreover, there are semantic relations between GO terms which can help to increase the accuracy of an annotating model that incorporates these relations.

Until now, several works that consider the GO term relations in their protein function prediction methods have been introduced [18]. The work in [19] obtains the principal directions in the GO term space by Singular Value Decomposition (SVD) to filter out noisy annotations. CSSAG [20] proposes a greedy hierarchical multi-label classification algorithm which can be used in both tree and DAG structured output spaces. To find the optimal solution, CSSAG searches for the best subgraph in the GO hierarchy. Inspired by the topic modeling studies in the text analysis field, [21] model GO terms as words that are from special topics. In fact, they consider these topics as new representations of functions. In [13], a label space dimensional reduction (LSDR) method which considers both the GO structure and the label distribution is introduced. By incorporating the label distribution in calculating latent representations for GO terms, it is able to consider semantic similarities which can not be necessarily derived from the GO DAG. GO2Vec [22] exploits a graph embedding algorithm, and node2vec [23], tries to obtain a vector representation for each GO term based on the structural information of GO graph. The authors in [22] also apply their method to GOA graph which includes both term-term, from the GO graph, and the term-protein, from the annotation information relations. They use these representations to calculate semantic similarities between GO terms and functional similarities between proteins. Onto2Vec [24] constructs a corpus of axioms based on the GO graph. These axioms describe the hierarchical relations in the GO DAG. It then uses the Word2Vec [25] algorithm to find feature vectors for GO terms by this corpus of sentences. Onto2Vec finds feature vectors for proteins by adding new axioms describing the annotating relations to the corpus or by a linear combination of feature vectors of terms in the protein’s GO annotations.

On the other hand, some researches have focused on introducing new methods for extracting features from raw protein sequences. The work in [26] extends the classical linear discriminant analysis to multi-label problems. They find the best subspace that discriminates samples from different classes and exploit it to obtain feature vectors for protein sequences. The recent success of deep learning algorithms in a large number of applications including bioinformatics [27] motivates researchers to adopt it in the computational protein function prediction. Inspired by the natural language processing concepts [28], use the Word2Vec [25] algorithm to extract a vector representation for protein sequences. They also experimentally show how their method successfully capture meaningful chemical and physical properties of proteins. In [29], the Long-Short-Term-Memory (LSTM) deep network is utilized to extract features from protein sequences and classify them into four functional categories. In addition to the power of deep learning models to extract complex features from input samples, the structure of LSTM allows the model to keep important features across long distances of sequences. However, none of the aforementioned deep models considers label correlations during feature extraction.

DeepGO [30] attempts to incorporate the structural information of the gene ontology to a deep feature extractor by explicitly enforcing the true-path-rule of the GO graph to the output. Their deep network includes an embedding layer followed by convolutional and fully connected layers. At the final step, they define an architecture of maximization layers that considers and propagates the GO structural information in the final results. However, using successive max layers in the final part of the network may not provide sufficient gradient (during the training process) to impose the structural constraints of the GO terms into the network. To utilize GO term correlations during the training process, the work in [31] employs multi-task deep neural networks for protein function prediction. In this architecture, some layers of the network are shared through the tasks, i.e. different GO terms. These layers help to extract more generalized and meaningful features from proteins. However, the loss function of this method is a sum of the prediction loss over all the tasks and does not include any information about the task correlations. In [32], authors claim that the transformer [33] model can extract more relevant features from amino acid sequences compared to convolutional layers. This is because the transformer is able to model all pairwise interactions between amino acids of a protein sequence. They also show by feeding the embedding of GO terms as the input, it is able to extract co-occurrence relations of the true-path-rule and use them for its final prediction.

In this paper, we propose to employ a Generative Adversarial Network (GAN) [34] to improve protein function prediction by simultaneously extracting GO term correlations. GANs were initially introduced for training a deep neural network to produce synthetic samples from a desired distribution [34]. These networks have generally two building blocks, generator and discriminator, which are trained in an adversarial training paradigm. The generator block synthesizes samples of a desired distribution and the discriminator assesses the generator outputs to distinguish them from real samples of the target distribution. During the training process, these two blocks fight against each other until an equilibrium point where the generator can fool the discriminator by its synthetic products. The considerable performance of GANs in the field of image processing [35, 36] motivates researchers to exploit them for other data types including biological ones. Works in [37, 38] use GANs to analyze gene expression profiles and works in [39, 40] attempt to synthesize genes and promoters by GANs. Recently authors in [41] have been proposed to perform data augmentation to generate synthetic training samples by a GAN to improve a classifier accuracy for annotating proteins.

Here we learn the mapping from the input protein to a binary vector of annotated GO terms by utilizing a conditional generator such that the resulted vector cannot be distinguished from valid annotating vectors by a discriminator. The feedback that is provided by the discriminator during the training process is imposed as a loss function to the deep neural network which is used to predict protein functions. By simultaneous training of the above networks, we learn a customized loss function for the annotating model, by considering available training data.

Moreover, considering term correlations helps to overcome noisy annotations that may corrupt the performance of a prediction model. We show that the proposed method is able to model co-occurrence relations that are not necessarily available in the current DAG model between GO terms. An important issue of computational prediction of protein functions is the shortage of positive samples for terms in the deeper levels of the GO DAG which are more specific and informative. Thus, considering semantic similarities is more critical for deeper GO terms. The proposed method achieves higher accuracy compared to the existing methods with the same number of training data and decreases the sample complexity of the problem. We demonstrate that the distance between the proposed and previous methods increases when moving through deeper terms which confirms the importance of incorporating the semantic and architectural similarities for deeper GO terms.

Materials and methods

In the proposed method, functionalities of a protein are described as a binary assignment vector whose elements show whether or not a protein is responsible for a GO term. Without the loss of generality, we can describe all correlations between GO terms as a joint distribution over the assignment vector of all proteins. For instance, let us consider it is impossible for a protein to be responsible for two special GO terms simultaneously. Then, the probability of assignment vectors in which the elements corresponding to these two special terms are active at the same time is zero. The proposed model learns a joint distribution over the function assignment vectors given the input protein sequence. It helps to extract semantic relations between GO terms for special sequence patterns. Hence, our model is capable of extracting more complicated relations. In the following subsections, we show how we learn the conditional joint distributions and utilize them to annotate protein sequences.

Let x denote a protein sample that contains either hand-crafted features of a protein sequence or a raw protein sequence itself and $y \in R^{c}$ denote an assignment vector where c shows the number of GO terms. We denote the conditional distribution over assignment vectors y, conditioned on the protein x distribution over the GO term assignments, by p_GO(y|x). We define a distribution p_m(y|x) that is modeled by a deep neural network to estimate p_GO(y|x).

Wasserstein generative adversarial network

Motivated by the success of GAN networks, many extensions have been introduced. Here, we adopt the Wasserstein Generative Adversarial Network (WGAN) [42]. Indeed, considering the arguments of [42] into account, we believe WGAN has better performance for learning a distribution over our discrete space, i.e. the space of all function assignment vectors. The loss function of a WGAN is defined as follows:

\begin{matrix} arg min_{G} arg max_{D \in L^{(1)}} E_{y \sim p_{r}} [D (y)] - E_{y \sim p_{m}} [D (y)], \end{matrix}

(1)

where p_r is the desired distribution to be modeled, p_m is the distribution trained by the generator G, and L⁽¹⁾ shows the family of 1-Lipschitz functions. Moreover, D and G denote the discriminator and generator networks, respectively. The term inside the argmin measures implicitly the distance between the two distributions p_r and p_m by utilizing the discriminator network. Hence, the model attempts to find a generator with distribution p_m that provides a good estimate for p_r.

Since we are trying to find the function assignment vector given the input protein sequence, inspired by the idea of conditional GAN [43], we design a conditional generator and a conditional discriminator. The real distribution that we try to learn is the distribution over assignment vectors conditioned on the protein sample, which is denoted by p_GO(y|x). The discriminator also takes the protein sequence and a function assignment vector (that can be a generated vector by the generator network or the real target vector for the protein sequence) and distinguishes whether this vector is real or fake (i.e. a generated one). Therefore, the loss function of the conditional WGAN for this problem can be defined as follows:

\begin{matrix} arg min_{G} arg max_{D \in L^{(1)}} E_{x \sim p_{x}} [E_{y \sim p_{G O} (y | x)} [D (x, y)] - \\ E_{y \sim p_{m} (y | x)} [D (x, y)]] \end{matrix}

(2)

Eq (2) shows the general conditional WGAN loss function. In the following subsections, we explain structures and detailed loss functions of the generator for the protein function prediction problem. The proposed method is called PFP-WGAN, since the Protein Function Prediction is accomplished by a conditional WGAN in our method.

Generator structure and loss function

The generator structure which we use for assigning functions to raw protein sequences directly, is depicted in Fig 1. The raw sequence is represented as a set of 8000 dimensional one-hot vectors to the model. These vectors are too sparse. Therefore, we put an embedding function at the first layer of the generator which converts the one-hot input into a vector of length 128. To have a stochastic generator we add a dropout layer with the rate of 0.2. It also decreases the probability of over-fitting.

For the next layer, we use 32 one-dimensional convolution filters which extract meaningful patterns from the sequence of amino acids. After the training process, each filter is responsible for detecting a specific pattern. By patterns we mean the existence of special sequence of amino acids in specific positions. Meaningful patterns are those which are correlated with different GO terms. When a sequence is passed through these filters an activation map, which shows the matching score between patterns and input sequence, is obtained. These filters are followed by a LeakyReLU activation function. To keep the resulted activation maps smaller and more manageable, they are then passed from an average-pooling layer with the filter size of 64 and stride of 32. We then send activation maps through two fully connected layers. These layers learn nonlinear functions of activation maps. If the model has been trained successfully, outputs of these functions are evidence of biochemical and biophysical features of a protein which are related to its functionalities and the generator is able to annotate protein sequences according to them. The size of the last layer is equal to the number of GO terms and this layer shows the resulting assignment vector. A Tanh activation function is utilized in this layer.

We also compare our method with a recent method that does not work on the raw sequence of proteins. There, we use hand-crafted features of sequences obtained by experts (similarly as in the compared method) and extract protein functionalities from them. Therefore, in this scenario, the generator takes hand-crafted features and without using embedding and convolutional layers yields the function assignment vector. This generator includes 3 fully connected layers with the LeakyReLU activation function and another fully connected layer with the Tanh activation function to find the output.

In order to train the generator, we use the following loss function:

\begin{matrix} arg min_{G} E_{x \sim p_{x}} [E_{y \sim p_{G O} (y | x)} [L (y, D (G (x)))] - λ_{1} D (x, G (x))], \end{matrix}

(3)

where the first term is the binary cross entropy loss function which directly compares the generator output for a special sample with its ground truth. The second term is the Wasserstein loss for the generator which is equal to the first term in Eq (2), since y ∼ p_m(y|x) shows the generator output and we can replace it with G(x). Finally, λ₁ is a hyper-parameter which is set to 0.03 in the first experiment and 0.00001 in the second experiment that is chosen according to the performance on the validation set.

Discriminator structure and loss function

We use two sets of real and fake pairs for training our conditional discriminator. The first one is the real set that includes pairs of input x and the corresponding vector assignment in training data. The second set includes fake pairs which consists of x and the corresponding generator’s output. The discriminator structure is shown in Fig 1. We extract a feature vector from raw sequences x by adding embedding, convolutional, max-pooling, and fully connected layers (exactly the same as the first four layers of the generator) to fulfill the condition for the discriminator. In the discriminator network, we first use a fully connected layer to extract features from assignment vectors. The discriminator then concatenates this feature vector with the prepared condition (from the input) and sends them through 5 fully connected layers. The last layer involves a single neuron which scores (protein,function) pairs to distinguish between fake or real ones. The discriminator loss function is formulated as:

\begin{matrix} arg min_{D} E_{x \sim p_{x}} [D (x, G (x)) - E_{y \sim p_{G O} (y | x)} [D (x, y)]] \\ + λ_{2} E_{(\tilde{x}, \tilde{y}) \in \tilde{p}} [{({∥ \nabla_{\tilde{y}} D (\tilde{x}, \tilde{y}) ∥}_{2} - 1)}^{2}], \end{matrix}

(4)

where the first line is the Wasserstein loss for the discriminator that is obtained from Eq (2). In Eq (4), we omit the constraint of L⁽¹⁾ from the search space of D(.) and replace it by the term in the second line of Eq (4) that is proposed by [44]. This term is a gradient penalty which keeps the gradient norm of the discriminator around 1. The pairs $(\tilde{x}, \tilde{y})$ are produced by weighted averaging (random weights from a uniform distribution) of real and fake pairs and $\tilde{p}$ describes the resulted distribution. Finally, λ₂ is a hyper-parameter which is set to 10 in both experiments that is chosen according to the performance on validation set.

Training and optimization

We train the generator and the discriminator networks alternatively by a training ratio of 10. It means for each iteration of training the generator, the discriminator training is achieved by 10 iterations. The loss functions are optimized by an Adam optimizer with the learning rate of 0.00001, and 20% of the training data is used for validation. Thus, we find the network’s weights by 80% of data and then evaluate the resulted model by the remaining 20% to tune the hyper-parameters. The algorithm has been implemented by the Keras deep learning library and trained and tested on a Nvidia gpu GeForce GTX 1080 Ti system.

Dataset and data representation

We report our results via testing on two different datasets. In the first experiment, we use the data gathered and filtered by [30] in which protein sequences are obtained from SwissProt downloaded on 2016-01. In this dataset, sequences with the length greater than 1002 and ambiguous amino acids are filtered out and sequences with the initial length that is less than 1002 are padded with zeros. In addition, similar to [30], we just keep annotated sequences with experimental evidence code (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC). The GO terms [16] are downloaded in OBO format http://geneontology.org/page/download-ontology on 2016-01 and similar to [30], terms with less than 250, 50, and 50 annotated proteins for each group of biological processes (BP), molecular functions (MF), and cellular components (CC) are omitted, respectively. This results in 932, 589, and 436 terms in each group. Finally, proteins are randomly divided into training (80%) and test (20%) sets. To represent raw sequences to our model (Fig 1), we divide them into trigrams of amino acids with the overlapping size of two. Considering a dictionary of all possible trigrams, we can show each trigram with a one-hot vector with length 8000. Therefore, each protein sequence is presented by 1000 vectors of length 8000.

The second dataset is identical to the one used in FFPred3 [45]. Protein sequences are from SwissProt’s version 2015-5 which are encoded to 258 features including 14 structural and functional aspects. The GO terms are downloaded on 2015-02 and include 605 BP terms, 158 MF terms, and 102 CC terms.

Evaluation measures

The main measure used to assess different methods is protein-centric F_max which was utilized in CAFA challenge [15] and in many recent related works [30, 31]. To calculate it, we define 100 different thresholds t ∈ [0, 1]. Then, for each protein and threshold t, we obtain the number of labels truly assigned to the protein (tp), the number of protein’s labels which are not assigned to it by the model (fn), and the number of labels which are falsely assigned to this protein by the model (fp). Then, the precision and recall are calculated as follows:

{Precision}_{t} = \frac{{tp}_{t}}{{tp}_{t} + {fp}_{t}},

(5)

{Recall}_{t} = \frac{{tp}_{t}}{{tp}_{t} + {fn}_{t}},

(6)

We average the above measures among all proteins to obtain AvePr_t and AveRe_t in each interval. Finally, F_max is calculated as follows:

\begin{matrix} F_{max} = max_{t} {\frac{2 \times {AvePr}_{t} \times {AveRe}_{t}}{{AvePr}_{t} + {AveRe}_{t}},} . \end{matrix}

(7)

The other measure which we use is term-centric F1 which is calculated for each label, separately. By defining tp as the number of samples truly assigned to a label, fn as the number of samples which are wrongly not assigned to that label, and fp as the number of samples wrongly assigned to it, we then use Eqs (5) and (6), to calculate the term-centric F1 as:

\begin{matrix} F1 = max_{t} {\frac{2 \times {Precision}_{t} \times {Recall}_{t}}{{Precision}_{t} + {Recall}_{t}}} . \end{matrix}

(8)

We also use three other term-centric measures which are useful for evaluating methods on imbalanced classification problems in which the number of available training samples in one class is much less than the number of samples in another class. The first measure is the Area Under Precision Recall (AUPR) that is obtained for each label as follows:

\begin{matrix} AUPR = \int_{- \infty}^{\infty} {Precision}_{t} \times {Recall}_{t}^{'} d t . \end{matrix}

(9)

Then, we average these values through all the labels and report it. The second and third ones are the Area Under ROC Curve (AUC-ROC) and the Mathews Correlation Coefficient (MCC) which are computed as in [30].

Finally, to check the consistency of results with the true-path-rule, we define the TPR score. We calculate this measure as follows:

\begin{matrix} TPR = \frac{1}{N} \sum_{n = 1}^{N} \sum_{t_{i} \in annot (p_{n})} card (anc (t_{i}) - [anc (t_{i}) \cap annot (p_{n})]) \end{matrix}

(10)

where card(.) shows the cardinality of a set, annot(.) is the set of terms in the annotation of a protein and anc(.) is the set of ancestors of a GO term. According to the true path rule of GO annotations, when a protein is annotated with a GO term, it should also be annotated with the corresponding ancestor terms. A conflict occurs when a protein is not annotated by one of the ancestors of the terms in its annotation. The TPR score calculates the expected number of conflicts in the annotation of a protein.

Results and discussions

Experiment 1

In this section, we evaluate PFP-WGAN on the first data set used in [30], and employ similar settings as those used in [30]. The end-to-end structure of the proposed model enables us to extract features from protein sequences and learning GO term correlations, simultaneously. Here, we utilize raw amino-acid sequences as the input to the model. In this part, we compare PFP-WGAN with BLAST [12] and DeepGO-Seq [30]. Setting of these algorithms are exactly as in [30]. DeepGO [30] utilizes a deep network to extract features from protein sequences and protein-protein interaction (PPI) networks and finds the proteins’ functions. Authors in [30] attempt to incorporate the structural information about GO by adding a maximization layer which explicitly enforce the “true path rule” at the final step of the network. However, for most of the known protein sequences, the information of protein-protein interaction network is not available. They discuss that in this situation one can use the PPI information of the most similar protein to the query sequence which is found by BLAST [12]. However, they did not report their method’s performance for this situation. In addition, this approach limits the algorithm to predict functions of only those sequences for which there is a sufficiently similar protein among those having the PPI information. Here, the DeepGO-Seq as a version of DeepGO that just uses protein sequences to extract functions [30] is compared with our method which only utilizes protein sequences as input. The results of BLAST [12], DeepGO-Seq [30], and PFP-WGAN are compared in Fig 2. In all three branches of GO, PFP-WGAN has better performance compared to BLAST and DeepGO-Seq. Despite the strength of deep network to extract features, in Biological Process (BP) and Molecular Function (MF) branches, DeepGO-Seq performs worse than BLAST. However, as shown in Fig 2, PFP-WGAN obtains a better F_max value than the other two competing algorithms. It is worth to mention that the proposed method, as opposed to [30], does not employ any additional knowledge (like GO DAG) and automatically discover the relations between different labels (i.e. GO terms).

Fig 2 — The F_max measure shows the superiority of PFP-WGAN in all three parts of the GO.

Table 1 shows the comparison of DeepGO-Seq and PFP-WGAN with three term-centric measures. PFP-WGAN shows better performance in all situations. Considering the fact that both methods utilize a deep network, this result confirms that our discriminator block can effectively extract the GO term correlations and impose them to the generator in order to find more accurate annotations.

Table 1. Three term-centric measures suitable for unbalanced data.

	BP			MF			CC
Method	AUPR	AUC	MCC	AUPR	AUC	MCC	AUPR	AUC	MCC
DeepGO-Seq	0.232	0.82	0.269	0.28	0.88	0.336	0.522	0.926	0.519
PFP-WGAN	0.241	0.830	0.281	0.302	0.891	0.347	0.535	0.932	0.524

Open in a new tab

Table 2 compares the average prediction time that each of DeepGO-Seq and PFP-WGAN needs. There is not a considerable difference between prediction times. However as the number of terms is increased, the growth of averaged prediction time in DeepGO is more than PFP-WGAN. So PFP-WGAN is more scalable for predicting a large number of GO terms simultaneously.

Table 2. Average prediction time in seconds for 1000 sequences.

Method	BP (932)	MF (589)	CC (436)
DeepGO-Seq	1.01	0.67	0.45
PFP-WGAN	0.96	0.84	0.73

Open in a new tab

We also calculate F1 for each GO term and average them through functions in each height of the GO graph. The differences between these averages for PFP-WGAN and DeepGO-Seq as a function of the height of terms in the GO graph are presented in Fig 3. As a general trend, we can observe differences between the results of these two methods increase when going through deeper terms. This confirms our intuition about incorporating structural information between output variables (GO terms) during the learning process. Interestingly, terms in higher levels show general functions and positive samples of all their child terms can also be considered as the positive samples of themselves too. In the GO DAG, there is no further valuable correlation between the terms and their child terms which can help to increase the accuracy. Nonetheless, deeper terms can make complicated relations with non-descendant and non-ancestors terms in the GO graph. Thus, extracting correlations between deeper terms is more informative and has higher impact on the classification accuracy. A main bottleneck of deeper terms in the GO graph is the shortage of positive samples which limits the performance of prediction models for these important terms. Fig 4 shows F1 measures for GO terms as a function of positive training samples. The improvement which is obtained by PFP-WGAN for rare terms is more considerable comparing to terms with large numbers of positive samples. It confirms that by incorporating the GO term correlations (more general than the GO DAG) we can compensate this shortage and obtain a better accuracy. Finally, the TPR score of the PFP-WGAN on this dataset is 0.78, 0.3 and 0.11 for BP, MF and CC branches respectively. In addition the total number of grandchild and ancestor pairs of the tree of each branch is 8323, 3266 and 3106.

Fig 3 — In the BP branch (as the most important part of GO with a large number of terms) differences are increased when moving through the deeper terms. In the most and half parts of the charts for CC and MF branches we can observe this pattern too.

Fig 4 — The improvement which is obtained by PFP-WGAN for rare terms is more considerable comparing to terms with large numbers of positive samples.

Experiment 2

Here, we compare PFP-WGAN with a recently proposed multi-task deep neural network for protein function prediction, MTDNN [31], and a shadow and greedy hierarchical multi-label classification strategy called CSSAG [20]. We evaluate the performance of algorithms on the dataset introduced in FFPred [45]. This dataset includes 258 sequence-derived features of each protein sample and maps them to 605 BP, 158 MF, and 102 CC GO terms. The obtained F_max measure for PFP-WGAN, MTDNN, CSSAG and 3 baseline algorithms is shown in Fig 5. BLAST and FFPred are the first two baseline algorithms. STDNN uses a single fully connected feedforward deep neural network for each GO term separately. Results of baselines and MTDNN are reported from [31]. As shown in Fig 5, the proposed PFP-WGAN has the highest score in all three BP, MF and CC domains.

Fig 5 — The F_max measure shows the superiority of PFP-WGAN in all three parts of the GO.

We also calculate a binary heatmap from results of PFP-WGAN and PFP-S. PFP-S is obtained by omitting the discriminator block, which is responsible for extracting GO term correlations, from PFP-WGAN. For each of these annotation results, this heatmap shows whether each pair of GO terms appear in at least one protein simultaneously or not. We also calculate this heatmap for the training data as the grandtruth which shows two GO terms are as consistent as that a protein can be annotated by both of them simultaneously. Table 3 compares the heatmap of PFP-WGAN and PFP-S against the grandtruth by the mean squared error (MSE). The MSE of PFP-WGAN is less than the MSE of PFP-S in all three branches. This fact confirms that PFP-WGAN has more ability to explore and utilize such relations from the training data.

Table 3. MSE of PFP-WGAN and PFP-S against grandtruth.

	BP	MF	CC
PFP-S	0.46	0.41	0.51
PFP-WGAN	0.38	0.39	0.46

Open in a new tab

Fig 6 shows the sensitivity of the PFP-WGAN on parameter λ₁. This parameter is chosen according to three measures F_max, micro F1 and averaging F1 through all terms on validation data. We sum these measures to obtain F and find best value for λ₁.

Finally, the TPR score of the PFP-WGAN on this dataset is 0.55, 0.11 and 0.13 for BP, MF and CC branches respectively. In addition the total number of grandchild and ancestor pairs of the tree of each branch is 292, 84 and 44.

Conclusion

As a consequence of improvements in high throughout sequencing technologies, we are faced with a large number of protein sequences in the nature about which there is no other available knowledge. This fact increases the importance of developing techniques to determine these protein’s functionalities just with their sequences. An early introduced track of designing such methods is based on finding the most similar protein in a database with known annotations to a query sequence [11] and assigning functions of the retrieved protein to the query. A main limitation of such algorithms occurs for sequences for which we do not have an adequately similar protein in the database. Thus, another trend is based on proposing algorithms which are able to extract biologically meaningful features form a sequence. Deep networks are the most powerful among the currently known models for feature extraction. Here, we propose a new deep architecture for protein function prediction, which uses the protein sequences only and is able to increase the annotation accuracy. The main strength of our model is its ability to extract function correlations and impose them to the annotating process. To the best of our knowledge, this is the first time that a conditional GAN architecture is used to improve the accuracy of a multi-label classification problem. Another advantage of our model corresponds to deeper terms in the GO graph. By extracting term correlations, we are able to decrease the sample complexity of deep terms and obtain higher accuracy. Therefore. we can find more detailed and specific annotations for proteins. The main drawback of our proposed model is that it requires relatively more computational resources, similar to other deep networks.

Our future work include trying to interpret extracted features by our deep network, which would help us to annotate proteins. Considering the ability of deep models in feature extraction we hope to find important biochemical and biophysical meaningful features.

Acknowledgments

Authors would like to thank Amirali Moeinfar for his help to implement the Wasserstein GAN architecture.

Data Availability

All relevant data are publicly accessible via the following URL: http://git.dml.ir/seyyedsalehi/PFP-WGAN.

Funding Statement

This work was supported by Iran National Science Foundation (INSF) [Grant No. 96006077]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 2012; 40: 938–950. 10.1093/nar/gks372 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Vladimir G, Barot M, Bonneau R. DeepNF: Deep network fusion for protein function prediction. Bioinformatics 2018; 34(22): 3873–3881. 10.1093/bioinformatics/bty440 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014; 43(D1): D447–D452. 10.1093/nar/gku1003 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Alshahrani M, Khan MA, Maddouri O, Kinjo AR, Queralt-Rosinach N, Hoehndorf R. Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics. 2017; 33(17): 2723–2730. 10.1093/bioinformatics/btx275 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics. 2018; 34(14): 2465–2473. 10.1093/bioinformatics/bty130 [DOI] [PubMed] [Google Scholar]
6. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004; 20(16): 2626–2635. 10.1093/bioinformatics/bth294 [DOI] [PubMed] [Google Scholar]
7. Cozzetto D, Buchan DW, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinformatics. 2013; 14(Suppl. 3): S1. 10.1186/1471-2105-14-S3-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Zhang C, Zheng W, Freddolino PL, Zhang Y. MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. Journal of molecular biology. 2018; 430(15): 2256–2265. 10.1016/j.jmb.2018.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Zhang F, Song H, Zeng M, Li Y, Kurgan L, Li M. DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions. Proteomics. 2019; 1900019. 10.1002/pmic.201900019 [DOI] [PubMed] [Google Scholar]
10. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018; 46(5): 2699. 10.1093/nar/gky092 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Shehu A, Barbará D, Molloy K. A survey of computational methods for protein function prediction. Big Data Analytics in Genomics. 2016; 11(1): 225–298. 10.1007/978-3-319-41279-5_7 [DOI] [Google Scholar]
12. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17): 3389–3402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Makrodimitris S, van Ham RC, Reinders MJ. Improving protein function prediction using protein sequence and GO-term similarities. Bioinformatics. 2018; 35(7): 1116–1124. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Gong Q, Ning W, Tian W. GoFDR: a sequence alignment based method for predicting protein functions. Methods. 2016; 93: 3–14. 10.1016/j.ymeth.2015.08.009 [DOI] [PubMed] [Google Scholar]
15. Predrag R, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013; 10(3): 221. 10.1038/nmeth.2340 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Gen. 2000; 25(1): 25. 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, et al. Gene Ontology: tool for the unification of biology. In Plant Bioinformatics. 2016. pp. 23–54. [Google Scholar]
18. Frasca M, Cesa-Bianchi N. Multitask protein function prediction through task dissimilarity. IEEE/ACM transactions on computational biology and bioinformatics. 2017. [DOI] [PubMed] [Google Scholar]
19. Khatri P, Done B, Rao A, Done A, Draghici S. A semantic analysis of the annotations of the human genome. Bioinformatics. 2005; 21(16): 3416–3421. 10.1093/bioinformatics/bti538 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bi, Wei, and James T. Kwok. Multi-label classification on tree-and dag-structured hierarchies. In Proceedings of the 28th International Conference on Machine Learning (ICML). 2011. pp. 17-24.
21.Masseroli M, Chicco D, Pinoli P. Probabilistic latent semantic analysis for prediction of gene ontology annotations. International joint conference on neural networks (IJCNN). 2012; pp. 1-8.
22. Zhong Xiaoshi, Kaalia Rama, and Rajapakse Jagath C. GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings. BMC genomics. 2019. pp. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016; pp. 855–864. [DOI] [PMC free article] [PubMed]
24. Smaili Fatima Zohra, Gao Xin, and Hoehndorf Robert. Onto2vec: joint vector-based representation of biological entities and their ontology-based annotations. Bioinformatics. 2018; pp. i52–i60. 10.1093/bioinformatics/bty259 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NIPS). 2013; pp. 3111–3119. [Google Scholar]
26. Wang H, Yan L, Huang H, Ding C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM transactions on computational biology and bioinformatics. 2017; 14(3): 503–513. 10.1109/TCBB.2016.2591529 [DOI] [PubMed] [Google Scholar]
27. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Briefings in bioinformatics. 2017; 18(5): 851–869. [DOI] [PubMed] [Google Scholar]
28. Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One. 2015; 10(11): e0141287. 10.1371/journal.pone.0141287 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Liu X. Deep recurrent neural network for protein function prediction from sequence. arXiv:1701.08318[Preprint]. 2017. Available from: https://arxiv.org/abs/1701.08318.
30. Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2017; 34(4): 660–668. 10.1093/bioinformatics/btx624 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Fa R, Cozzetto D, Wan C, Jones DT. Predicting human protein function with multi-task deep neural networks. PloS One. 2018; 13(6): e0198216. 10.1371/journal.pone.0198216 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Duong, D. B., Gai, L., Uppunda, A., Le, D., Eskin, E., Li, J. J., et al. Annotating Gene Ontology terms for protein sequences with the Transformer model. bioRxiv [Preprint] 2020. Available from: https://www.biorxiv.org/content/10.1101/2020.01.31.929604v1.abstract.
33.Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al. Attention is all you need. In Advances in neural information processing systems (NIPS). 2017. pp. 5998-6008.
34.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In Advances in neural information processing systems (NIPS). 2014. pp. 2672-2680.
35.Choi, Yunjey, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018. pp. 8789-8797.
36.Zhang, Zizhao, Lin Yang, and Yefeng Zheng. Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018. pp. 9242-9251.
37.Ghasedi Dizaji, Kamran, Xiaoqian Wang, and Heng Huang. Semi-supervised generative adversarial network for gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018. pp. 1435-1444.
38.Ghahramani, Arsham, Fiona M. Watt, and Nicholas M. Luscombe. Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv [Preprint] 2018. Available from: https://www.biorxiv.org/content/10.1101/262501v2.full.
39. Gupta Anvita, and Zou James. Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence 1. 2019. pp. 105–111. 10.1038/s42256-019-0017-4 [DOI] [Google Scholar]
40.Wang, Ye, Haochen Wang, Liyang Liu, and Xiaowo Wang. Synthetic promoter design in Escherichia coli based on generative adversarial network. bioRxiv [Preprint] 2019. Available from: https://www.biorxiv.org/content/10.1101/563775v1.abstract. [DOI] [PMC free article] [PubMed]
41.Wan, Cen, and David T. Jones. Improving protein function prediction with synthetic feature samples created by generative adversarial networks.. bioRxiv [Preprint] 2019. Available from: https://www.biorxiv.org/content/10.1101/730143v1.abstract.
42.Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In International Conference on Machine Learning (ICML). 2017. pp. 214-223.
43.Mirza M, Osindero S. Conditional generative adversarial networks. arXiv:1411.1784 [Preprint] 2014. Available from: https://arxiv.org/abs/1709.02023.
44.Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS). 2017. pp. 5767-5777.
45. Domenico C, Minneci F, Currant H, Jones DT. FFPred 3: feature-based function prediction for all Gene Ontology domains. Scientific Rep. 2016; 6: 31865. 10.1038/srep31865 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0244430.r001

Decision Letter 0

Vasilis J Promponas

23 Jun 2020

PONE-D-20-05418

PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks

PLOS ONE

Dear Dr. Rabiee,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Two expert reviewers have seen your manuscript and I trust that you will find their comments (see at the bottom of this email) invaluable for preparing a revised version of your work. They highlight several important points - both in terms of presentation as well as technical issues - that should be carefully addressed in a revised manuscript.

Please submit your revised manuscript by Aug 07 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Vasilis J Promponas

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript entitled “PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks” by S. Seyyedsalehi, M. Soleymani, H. Rabiee and M. Mofrad describes PFP-WGAN, a sequence-based tool for protein function prediction. The Authors constructed PFP-WGAN, a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the gene function annotation process. PFP-WGAN is evaluated and compared to a similar methods reported previously. According to the authors, one of the key advantages of the method is the increased accuracy in predicting more specific function terms associated with the gene.

The manuscript deals with the important problem and tackle it with cutting edge methodology. However, more detailed explanations are necessary to bring the methodology closer to the PlosOne readership.

Major concerns:

Generative Adversarial Network should be mentioned in the Introduction, together with a few lines of description advantages of the approach. The section "Generator Structure and Loss Function" needs to be described in more details.

The authors stated that convolution filters allow for filtering meaningful patterns. Please, explain in more details the filtering procedure and what is meaningful pattern.

Figure 1 needs to be describe in many more details. If possible stages of biological data flow should be presented also. This will facilitate method comprehension to the life scientist.

The Authors should provide information on processing time. How fast PFP-WGAN is in prediction of function for 1000 proteins? How PFP-WGAN compares to similar method such as DEEP-GO?

The Authors should provide an example of both input and output files.

Contrary to what is claimed in Fig. 4 description, FFPred is superior in predicting CC terms. This should be corrected and discussed.

Case studies on how PFP-WGAN outperforms other methods in predicting deep terms in various subontologies would be valuable addition to the manuscript.

The complete training and test sets need to be submitted.

Minor concerns:

Please, use term “child terms” instead of “children”.

Reviewer #2: General Summary: The manuscript by Seyyedsalehi et al. proposes a novel way to train neural protein function predictors. Instead of a standard classification loss, they authors attempt to capture GO term correlations by training an adversarial network. They compare to a multi-label CNN and a multi-task multi-layer perceptron and show that their approach achieves higher Fmax.

The GO is a complicated structure with several constraints such as the true-path rule, term co-occurences and mutual exclusivities, making it difficult to come up with a good “hand-crafted” loss function that also reflects the “realism” of a predicted GO annotation. Therefore, the concept of this study, i.e. learning if a prediction is realistic or not from data is innovative and very interesting. However, I have some concerns about the experiments, mainly the use of appropriate baselines, and the lack of interpretation of the results.

Major comments

1) One thing that troubled me while reading the manuscript is whether this model should be called a GAN. GANs are Generative models whose input is a noise vector (and an extra feature vector in the case of the conditional GANs) and their goal is typically to generate realistic-looking data, such as images, from scratch. Here, there is no noise input (Fig. 1), only a feature vector, and we are dealing with a classic classification task, where an output y is deterministically assigned to an input x. It is trained in an adversarial way, which is the novelty here, but I find that calling it a GAN is a little misleading.

2) In lines 40-51, three previously published methods of exploiting label correlations for function prediction are mentioned (refs 31,32, 28), but the authors do not compare to any of them, because they are “shallow”. I find that this is not a convincing argument not to compare to at least one of them. Comparing to a linear model would give a good baseline for label-correlation-based methods and provide further insight on the superiority of the proposed model. Another relevant linear label dimension reduction model that the authors could compare to is the following:

Bi W., Kwok J. (2011) Multi-label classification on tree-and DAG-structured hierarchies. In: International Conference on Machine Learning

3) Related to that, in lines 88-90 the authors claim that “this is the first time that a deep model is used to explore complex relations and semantic similarities between the GO terms”. This statement is incorrect. The authors should consider the following works: a) GO2vec, Zhong et al., BMC Genomics, 2020, b) Onto2vec, Smaili et al., Bioinformatics, 2018, and c) GOAT, Duong et al., biorxiv, 2020. The last one was specifically modelling GO terms with a deep net for protein function prediction. These are very relevant works and the authors need to benchmark their method against (some of) them.

4) Line 205, The authors mention using a validation set to tune hyperparameters, but in lines 178 and 198 they report “manually” setting the hyperparameters. It has to be clarified what other values were considered for these parameters and how the “manual” decision was made. If the test set is used to decide on these parameters, then the results cannot be trusted.

5) I completely missed the interpretation of the results. Yes, the proposed model works clearly better, but there is no evidence provided that the improvement is indeed due to exploiting label correlations as the authors claim. The first thing that I would like to see is whether the label vectors that are the output of the generator are consistent with the “true path rule”.

6) Again on the interpretation of label correlations: could the authors provide some examples of relationships between labels that their model manages to capture that are not captured by a traditional neural network? For example, there is already evidence that linear GO term correlation models can capture co-occurrence and mutual exclusivity relations between pairs of terms (ref 28). Can the proposed method go beyond this and find more complex relationships?

7) The generator is trained using a weighted sum of a standard cross-entropy loss and the novel adversarial loss proposed in this work. What is the effect of changing the weight parameter lambda_1? What if one makes it really small to only use the cross-entropy? Is then the performance gain lost? And if it is made much larger? Can the model learn to predict functions without the cross-entropy component?

Minor comments

8) The figures are barely readable in the pdf version. The authors should provide higher resolution versions.

9) Broken link that should contain the data (error 404:not found) https://github.com/ictic-bioinformatics/

10) GANs have been previously used in protein function prediction to generate negative examples (Wan and Jones, 2019, biorxiv)

11) Shouldn’t p_m and p_r be flipped in equation 1? Typically in GANs the discriminator has high output for real examples.

12) The DeepGO method is not really modelling label correlations, it is simply enforcing the “true path rule” of the GO graph.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Feb 25;16(2):e0244430. doi: 10.1371/journal.pone.0244430.r002

Author response to Decision Letter 0

30 Aug 2020

Dear Editor,

The authors would like to thank you for providing the opportunity to respond to the comments. The manuscript has been completely revised based on the reviewers’ comments. We also adjusted images carefully according to rules of P¬¬los One. In the revised manuscript, we have addressed all editor’s and reviewers’ concerns. Detailed response and discussion are provided in the following pages.

All the changes to the original manuscript, including new references, are highlighted in yellow in the marked-up copy of the manuscript. The authors’ responses appear in blue color below.

Sincerely Yours,

Hamid R. Rabiee

The corresponding author

August 30, 2020

Reviewer 1

Comments to the Author

The manuscript entitled “PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks” by S. Seyyedsalehi, M. Soleymani, H. Rabiee and M. Mofrad describes PFP-WGAN, a sequence-based tool for protein function prediction. The Authors constructed PFP-WGAN, a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the gene function annotation process. PFP-WGAN is evaluated and compared to a similar methods reported previously. According to the authors, one of the key advantages of the method is the increased accuracy in predicting more specific function terms associated with the gene.

[Authors’ response:] The authors would like to thank the reviewer for his/her in-depth analysis and useful comments. Below we have listed the issues raised by you and addressed them as best as we could, and we have revised the manuscript in accordance with the reviewer’s comments, as needed.

Major concerns:

1. Generative Adversarial Network should be mentioned in the Introduction, together with a few lines of description advantages of the approach. The section "Generator Structure and Loss Function" needs to be described in more details.

The authors stated that convolution filters allow for filtering meaningful patterns. Please, explain in more details the filtering procedure and what is meaningful pattern.

[Authors’ response:] We added a description of GANs and their successes and applications to the introduction between Lines 99-113.

According to the reviewer’s comment, we explained "Generator Structure and Loss Function" in more detail and explained filtering procedure and meaningful patterns in Lines 185-191 as follows:

“For the next layer, we use 32 one-dimensional convolution filters which extract meaningful patterns from the sequence of amino acids. After the training process, each filter is responsible for detecting a specific pattern. By patterns we mean the existence of special sequence of amino acids in specific positions. Meaningful patterns are those which are correlated with different GO terms. When a sequence is passed through these filters an activation map, which shows the matching score between patterns and input sequence, is obtained.”

2. Figure 1 needs to be described in many more details. If possible stages of biological data flow should be presented also. This will facilitate method comprehension to the life scientist.

[Authors’ response:] Thank you for this valuable suggestion. We added more details about the proposed method to Fig. 1. We also described it in more details in its caption and section "Generator Structure and Loss Function" in Lines 185 – 200.

3. The Authors should provide information on processing time. How fast PFP-WGAN is in prediction of function for 1000 proteins? How PFP-WGAN compares to similar method such as DEEP-GO?

[Authors’ response:] Thank you for indicating this point. We added a comparison between the prediction time of PFP-WGAN and DEEPGO in Table 2 of the revised manuscript. There is not a considerable difference between prediction times. However, as the number of terms is increased, the growth of averaged prediction time in DeepGO is more than PFP-WGAN. Therefore, PFP-WGAN is more scalable for predicting a large number of GO terms simultaneously.

4. The Authors should provide an example of both input and output files.

[Authors’ response:] A sample of input and output files are provided in: http://git.dml.ir/seyyedsalehi/PFP-WGAN

5. Contrary to what is claimed in Fig. 4 description, FFPred is superior in predicting CC terms. This should be corrected and discussed.

[Authors’ response:] Thank you for mentioning this mistake in reporting the results. Although the FFPred and our method have comparable performance on CC Terms, after considering comments of the reviewers, we corrected the report of the results and now the previous claim is correct as can be seen in Fig. 5.

6. Case studies on how PFP-WGAN outperforms other methods in predicting deep terms in various subontologies would be valuable addition to the manuscript.

[Authors’ response:] Thank you for the valuable suggestion. In Fig. 4, we show the F1 measure obtained for GO terms by PFP-WGAN and DeepGO as a function of available positive training samples. As mentioned in the manuscript in Lines 344-349, the improvement obtained for rare terms is more considerable. As deeper terms have less number of positive samples, it confirms the ability of PFP-WGAN to improve F1 measure for them.

7. The complete training and test sets need to be submitted.

[Authors’ response:] All data and code are now available at:

http://git.dml.ir/seyyedsalehi/PFP-WGAN.

I addition, as we have mentioned in the manuscript, the data for the first experiment is the original data of DeepGO [11], which is available at:

http://deepgo.bio2vec.net/data/deepgo/data.tar.gz, and the data which is used in the second experiment is the same as the one used in MTDNN [5], available from:

http://bioinf.cs.ucl.ac.uk/downloads/mtdnn.

Minor concerns:

8. Please, use term “child terms” instead of “children”.

[Authors’ response:] Thanks for your suggestion. We replaced all “children” with “child terms” as requested.

Reviewer 2

Comments to the Author

General Summary: The manuscript by Seyyedsalehi et al. proposes a novel way to train neural protein function predictors. Instead of a standard classification loss, the authors attempt to capture GO term correlations by training an adversarial network. They compare to a multi-label CNN and a multi-task multi-layer perceptron and show that their approach achieves higher Fmax.

The GO is a complicated structure with several constraints such as the true-path rule, term co-occurrences and mutual exclusivities, making it difficult to come up with a good “hand-crafted” loss function that also reflects the “realism” of a predicted GO annotation. Therefore, the concept of this study, i.e. learning if a prediction is realistic or not from data is innovative and very interesting. However, I have some concerns about the experiments, mainly the use of appropriate baselines, and the lack of interpretation of the results.

[Authors’ response:] The authors would like to thank the reviewer for his/her in-depth analysis and useful comments. The comments are answered accordingly as follows.

Major concerns:

1. One thing that troubled me while reading the manuscript is whether this model should be called a GAN. GANs are Generative models whose input is a noise vector (and an extra feature vector in the case of the conditional GANs) and their goal is typically to generate realistic-looking data, such as images, from scratch. Here, there is no noise input (Fig. 1), only a feature vector, and we are dealing with a classic classification task, where an output y is deterministically assigned to an input x. It is trained in an adversarial way, which is the novelty here, but I find that calling it a GAN is a little misleading.

[Authors’ response:] We agree with the respected reviewer that the generator of the main GAN generally needs to have a noise signal as an input. But it is common in the literature to call the type of conditional models similar to our proposed model as GAN, too. For example, the following works refer to their method as GAN while their generator’s inputs do not include noise:

R1. Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

R2. Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A. Efros. "Unpaired image-to-image translation using cycle-consistent adversarial networks." In Proceedings of the IEEE international conference on computer vision, pp. 2223-2232. 2017.

Without the noise signal, the conditional generator learns a deterministic mapping between the input and the output space (i.e. it generates a sample from the output space given the input). As mentioned in (R1), past conditional GANs have acknowledged using Gaussian noise but in some situations the generator learns to simply ignore it. Therefore, similar to the work in (R1) we provide the noise in the form of a dropout layer (as it is shown in Fig 1).

2. In lines 40-51, three previously published methods of exploiting label correlations for function prediction are mentioned (refs 31, 32, 28), but the authors do not compare to any of them, because they are “shallow”. I find that this is not a convincing argument not to compare to at least one of them. Comparing to a linear model would give a good baseline for label-correlation-based methods and provide further insight on the superiority of the proposed model. Another relevant linear label dimension reduction model that the authors could compare to is the following:

Bi W., Kwok J. (2011) Multi-label classification on tree-and DAG-structured hierarchies. In: International Conference on Machine Learning.

[Authors’ response:] Thanks for this valuable comment. We introduced “Multi-label classification on tree-and DAG-structured hierarchies” in the introduction in Lines 44-47 and reported its result in the second experiment. Because this paper does not work on sequences directly and needs a feature vector as input, we did the second experiment on it. The result is shown in Fig.5 of the revised manuscript.

3. Related to that, in lines 88-90 the authors claim that “this is the first time that a deep model is used to explore complex relations and semantic similarities between the GO terms”. This statement is incorrect. The authors should consider the following works: a) GO2vec, Zhong et al., BMC Genomics, 2020, b) Onto2vec, Smaili et al., Bioinformatics, 2018, and c) GOAT, Duong et al., biorxiv, 2020. The last one was specifically modelling GO terms with a deep net for protein function prediction. These are very relevant works and the authors need to benchmark their method against (some of) them.

[Authors’ response:] This comment is highly appreciated. We omitted the sentence “this is the first time that a deep model is used to explore complex relations and semantic similarities between the GO terms” and added those recent works to the paper (references [34], [36] and [37]). We review GO2vec in Lines 53-58 and Onto2vec in Lines 58-63. However, it is worth mentioning that the methods like Onto2vec [34] are not deep, although they are mentioned as deep methods in some studies. Moreover, the goal of GO2vec and Onto2vec is different from ours. They use the representations of GO terms to compute semantic similarity between GO terms and consequently the functional similarity between proteins. However, since they find a feature vector for proteins based on their annotations, their method cannot be applied to new proteins without GO annotations. Therefore, their method cannot directly be used for protein function prediction.

GOAT proposes a model to annotate proteins by GO and as you have mentioned in your comment, it is completely relevant to our work. We review it in the introduction of revised manuscript in Lines 93-98. However, as this work is proposed recently we found some ambiguities and inconsistencies in its code and paper, and we could not compare our method with it during the revision time. In Lines 217-220 of their bioRxiv manuscript, they explain that they use the dataset of DeepGO and omit proteins without annotations. We omitted such proteins from the original DeepGO dataset nonetheless we could not obtain the dataset provided by them in:

https://drive.google.com/drive/folders/1cuO2WtfZX2_vyk0Z8S7suYlYzpbwPl-j

In addition, the size of the dataset they explain in their paper in Lines 217-220 is completely different from the size of the dataset provided by them in the above link, original DeepGO dataset, and the original DeepGO dataset filtered by un-annotated proteins (as mentioned in their paper). Moreover, their code is not easy-to-use and we could not obtain its results on our datasets.

4. Line 205, The authors mention using a validation set to tune hyperparameters, but in lines 178 and 198 they report “manually” setting the hyperparameters. It has to be clarified what other values were considered for these parameters and how the “manual” decision was made. If the test set is used to decide on these parameters, then the results cannot be trusted.

[Authors’ response:] Thank you for pointing out this inconsistency. As mentioned in section “Training and Optimization”, we use 20% of the training data for validation to tune hyper-parameters. We added details in Lines 212-214 and 233-235, and emphasized using validation set to tune hyper-parameters.

5. I completely missed the interpretation of the results. Yes, the proposed model works clearly better, but there is no evidence provided that the improvement is indeed due to exploiting label correlations as the authors claim. The first thing that I would like to see is whether the label vectors that are the output of the generator are consistent with the “true path rule”.

[Authors’ response:] The main idea of our work is adding a discriminator to a deep annotation model to judge it by observing valid GO annotations in SwissProt. As explained in the manuscript, this discriminator learns a distribution over GO annotation space. If it is trained successfully, it can reflect the correlation between GO terms by giving high scores to annotations with respect to co-occurrence and mutual exclusivity relations between GO terms. The discriminator captures these relations automatically by observing valid GO annotations. Therefore, the superiority of the model which is obtained by this block is obviously the result of learning GO correlations.

To evaluate the consistency of our outputs with the true-path-rule, we define (Lines 288-291) and report the TPR score in Lines 349-351 and 377-379. TPR shows the expected number of conflicts in the annotation of a protein. A conflict is a pair of terms that a protein is annotated by just one of them which is the grandchild of another one. The TPR score of the PFP-WGAN on the first dataset is 0.78, 0.01 and 0.11 for BP, MF and CC branches, respectively. In addition, the total number of grandchild and ancestor pairs of the tree of each branch is 8323, 3266 and 3106. In the second dataset, this score is 0.56, 0.11 and 0.12 for BP, MF and CC branches, respectively. In addition, the total number of grandchild and ancestor pairs of the tree of each branch is 292, 84 and 44.

6. Again on the interpretation of label correlations: could the authors provide some examples of relationships between labels that their model manages to capture that are not captured by a traditional neural network? For example, there is already evidence that linear GO term correlation models can capture co-occurrence and mutual exclusivity relations between pairs of terms (ref. 28). Can the proposed method go beyond this and find more complex relationships?

[Authors’ response:] In the proposed model, we exploit a conditional GAN for protein annotation and learning GO term relations simultaneously. It means the discriminator learns a distribution over GO annotations conditioned on the protein sequence. Therefore, it is also able to extract correlations which are valid for specific sequence patterns.

For example, in general it could be possible that there is no co-occurrence relation between two GO terms, but for protein sequences with special amino acids in special positions these two terms occurred concurrently.

Moreover, the proposed method is not limited to checking binary relations at all. It can discover complex relations between multiple GO terms, thanks to its discriminator which processes whole functions together using a deep network.

To evaluate the ability of method in capturing GO correlations which exist in the training dataset, we added a new analysis to the revised version of the manuscript in Lines 364-373. For each of training and testing dataset, we calculate a binary heatmap which shows whether two GO terms appear in at least one protein, simultaneously or not. Therefore, these GO terms are as consistent, such that a protein can be annotated by both of them simultaneously. We obtain this heatmap for the training dataset where predictions obtained by PFP-WGAN and prediction obtained by omitting the adversarial loss from PFP-WGAN. Results of this analysis is provided in the revised manuscript. As shown, The MSE between the heatmap of PFP-WGAN and training dataset is less than the MSE between the simple deep network and training dataset. This fact confirms that PFP-WGAN has more ability to explore and utilize these relations from the training data.

7. The generator is trained using a weighted sum of a standard cross-entropy loss and the novel adversarial loss proposed in this work. What is the effect of changing the weight parameter lambda_1? What if one makes it really small to only use the cross-entropy? Is then the performance gain lost? And if it is made much larger? Can the model learn to predict functions without the cross-entropy component?

[Authors’ response:] Both of these losses are necessary for our training process and have different information. For each input sample the binary cross-entropy compares the obtained output by the ground truth for this specific sample. But the adversarial loss judges this pair (the input sequence and obtained output) considering all inputs and their ground truths. In fact, the discriminator checks if the obtained result has features of a valid annotation or not. We added the sensitivity analysis of PFP-WGAN on this parameter in Lines 374-376 and Fig. 6. As shown in this figure, by omitting each of these losses the performance is being degraded.

Minor concerns:

8. The figures are barely readable in the pdf version. The authors should provide higher resolution versions.

[Authors’ response:] Thank you for indicating this issue. We have provided higher resolution versions of figures in the revised version.

9. Broken link that should contain the data (error 404:not found) https://github.com/ictic-bioinformatics/

[Authors’ response:] All data and code are now available at:

http://git.dml.ir/seyyedsalehi/PFP-WGAN.

In addition, as we have mentioned in the manuscript, the data for the first experiment is the original data of DeepGO [11] which is available from:

http://deepgo.bio2vec.net/data/deepgo/data.tar.gz, And the data which is used in second experiment is the same as one used in MTDNN [5] available from:

http://bioinf.cs.ucl.ac.uk/downloads/mtdnn.

10. GANs have been previously used in protein function prediction to generate negative examples (Wan and Jones, 2019, biorxiv).

[Authors’ response:] The comment is highly appreciated. The mentioned reference has been carefully reviewed in Lines 111-113 and added to the references (reference [45]). However, this work is completely different from ours. It utilizes the GAN to generate samples (i.e. protein feature vectors) and perform data augmentation. It then uses them as training samples to increase the accuracy of a classifier which predict protein’s functions. But we exploit an adversarial approach (called GAN) to extract GO term correlations. We do not generate any protein and instead we try to learn the mapping between the proteins and the functions in an adversarial manner (by considering the adversarial loss).

11. Shouldn’t p_m and p_r be flipped in equation 1? Typically in GANs the discriminator has high output for real examples.

[Authors’ response:] Thanks for mentioning this mistake in notations. Symbols p_r and p_m are now flipped in Eq. 1, and all other equations which are affected.

12. The DeepGO method is not really modelling label correlations, it is simply enforcing the “true path rule” of the GO graph.

[Authors’ response:] Thank you for this comment. We highlighted this fact in the introduction part where we have reviewed DeepGO (Lines 80-82).

DeepGO just adds a maximization layer as the final step of its network. Authors in DeepGO attempt to impose this structural information during their training phase by calculating the loss function for the output of the maximization layer. However, as we discussed in the introduction part, the model cannot provide a sufficient gradient.

PLoS One. doi: 10.1371/journal.pone.0244430.r003

Decision Letter 1

Alexandros Iosiﬁdis

22 Oct 2020

PONE-D-20-05418R1

PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks

PLOS ONE

Dear Dr. Rabiee,

Please submit your revised manuscript by Dec 06 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Alexandros Iosiﬁdis

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Both reviewers agree that the paper has merits. Please address the comments provided by Reviewer 2 and provide a point-to-point response letter in your revision.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have satisfyingly addressed my comments. However, data are still not accessible as link http://git.dml.ir/seyyedsalehi/PFP-WGAN is not active.

This should be fixed before manuscript is accepted for publishing.

Reviewer #2: The authors have addressed most of my comments, but some minor points remain:

1. Table 3 shows that the WGAN is better at capturing (linear) co-occurrence relations between terms. In lines 124-125 the authors state that this model ‘ is able to model more complicated and higher level correlations that are not necessarily available in the current DAG model’. The results do not show any evidence of ability to model higher order relations, so this statement should be removed or changed to something like ‘is able to mode co-occurence relations that are not necessarily available in the current DAG model’

2. In the text authors mention the use of dropout to avoid overfitting, but from their answers to comment 1 it seems they also use dropout to provide stochasticity for the generator. If this is the case, it should be mentioned in the manuscript

3. The authors should explain the TPR score better: preferably provide a formula and explain what it means

4. Typo in line 56

5. The figure definition is better, but still not publication-quality in my opinion. The editorial stuff can perhaps provide information on how to generate high-quality figures

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2021 Feb 25;16(2):e0244430. doi: 10.1371/journal.pone.0244430.r004

Author response to Decision Letter 1

30 Oct 2020

Reviewer 1

Comments to the Author:

The authors have satisfyingly addressed my comments. However, data are still not accessible as link http://git.dml.ir/seyyedsalehi/PFP-WGAN is not active.

This should be fixed before manuscript is accepted for publishing.

[Authors’ response:] Thank you. Possibly the link was unreachable at the moment you have checked it, because of some temporary technical issues. The problem is solved now and the link is working properly.

In addition to link, the original data is alternatively available from the previous works that are referenced in the paper, i.e. Ref[5] and Ref[11].

Reviewer 2

Comments to the Author:

The authors have addressed most of my comments, but some minor points remain:

1. Table 3 shows that the WGAN is better at capturing (linear) co-occurrence relations between terms. In lines 124-125 the authors state that this model ‘ is able to model more complicated and higher level correlations that are not necessarily available in the current DAG model’. The results do not show any evidence of ability to model higher order relations, so this statement should be removed or changed to something like ‘is able to model co-occurrence relations that are not necessarily available in the current DAG model’

[Authors’ response:] Thank you for indicating this point. We changed this sentence based on your suggestion in lines 123-125.

[Authors’ response:] Thank you for this valuable comment. In line 183 we mentioned that we add the dropout layer to have a stochastic generator in addition to avoid overfitting.

3. The authors should explain the TPR score better: preferably provide a formula and explain what it means

[Authors’ response:] Thank you for your suggestion. We provide a formula for TPR in equation (10) and we described it in lines 288-295.

4. Typo in line 56

[Authors’ response:] “grpah” is changed to graph.

5. The figure definition is better, but still not publication-quality in my opinion. The editorial stuff can perhaps provide information on how to generate high-quality figures.

[Authors’ response:] We tried to follow the rules provided on the following page:

https://journals.plos.org/plosone/s/figures

All of our original figures that we submit to PLOS ONE are in TIF format with 300 dpi resolution. All the dimensions are also consistent with the mentioned rules. However, the draft file you may have received has lower quality images. One reason could be that this draft file is auto-generated. We follow up with the editorial staff instructions to make sure the quality of figures are as expected in the final version of the paper.

PLoS One. doi: 10.1371/journal.pone.0244430.r005

Decision Letter 2

Alexandros Iosiﬁdis

10 Dec 2020

PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks

PONE-D-20-05418R2

Dear Dr. Rabiee,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Alexandros Iosiﬁdis

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The Reviewers are satisfied with the current version of the paper. Congratulations on the acceptance of your paper.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0244430.r006

Acceptance letter

Alexandros Iosiﬁdis

18 Dec 2020

PONE-D-20-05418R2

PFP-WGAN: Protein Function Prediction by Discovering Gene Ontology Term Correlations with Generative Adversarial Networks

Dear Dr. Rabiee:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Alexandros Iosiﬁdis

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All relevant data are publicly accessible via the following URL: http://git.dml.ir/seyyedsalehi/PFP-WGAN.

[pone.0244430.ref001] 1. Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 2012; 40: 938–950. 10.1093/nar/gks372 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref002] 2. Vladimir G, Barot M, Bonneau R. DeepNF: Deep network fusion for protein function prediction. Bioinformatics 2018; 34(22): 3873–3881. 10.1093/bioinformatics/bty440 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref003] 3. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014; 43(D1): D447–D452. 10.1093/nar/gku1003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref004] 4. Alshahrani M, Khan MA, Maddouri O, Kinjo AR, Queralt-Rosinach N, Hoehndorf R. Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics. 2017; 33(17): 2723–2730. 10.1093/bioinformatics/btx275 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref005] 5. You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics. 2018; 34(14): 2465–2473. 10.1093/bioinformatics/bty130 [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref006] 6. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004; 20(16): 2626–2635. 10.1093/bioinformatics/bth294 [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref007] 7. Cozzetto D, Buchan DW, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinformatics. 2013; 14(Suppl. 3): S1. 10.1186/1471-2105-14-S3-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref008] 8. Zhang C, Zheng W, Freddolino PL, Zhang Y. MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. Journal of molecular biology. 2018; 430(15): 2256–2265. 10.1016/j.jmb.2018.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref009] 9. Zhang F, Song H, Zeng M, Li Y, Kurgan L, Li M. DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions. Proteomics. 2019; 1900019. 10.1002/pmic.201900019 [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref010] 10. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018; 46(5): 2699. 10.1093/nar/gky092 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref011] 11. Shehu A, Barbará D, Molloy K. A survey of computational methods for protein function prediction. Big Data Analytics in Genomics. 2016; 11(1): 225–298. 10.1007/978-3-319-41279-5_7 [DOI] [Google Scholar]

[pone.0244430.ref012] 12. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17): 3389–3402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref013] 13. Makrodimitris S, van Ham RC, Reinders MJ. Improving protein function prediction using protein sequence and GO-term similarities. Bioinformatics. 2018; 35(7): 1116–1124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref014] 14. Gong Q, Ning W, Tian W. GoFDR: a sequence alignment based method for predicting protein functions. Methods. 2016; 93: 3–14. 10.1016/j.ymeth.2015.08.009 [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref015] 15. Predrag R, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013; 10(3): 221. 10.1038/nmeth.2340 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref016] 16. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Gen. 2000; 25(1): 25. 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref017] 17. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, et al. Gene Ontology: tool for the unification of biology. In Plant Bioinformatics. 2016. pp. 23–54. [Google Scholar]

[pone.0244430.ref018] 18. Frasca M, Cesa-Bianchi N. Multitask protein function prediction through task dissimilarity. IEEE/ACM transactions on computational biology and bioinformatics. 2017. [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref019] 19. Khatri P, Done B, Rao A, Done A, Draghici S. A semantic analysis of the annotations of the human genome. Bioinformatics. 2005; 21(16): 3416–3421. 10.1093/bioinformatics/bti538 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref020] 20.Bi, Wei, and James T. Kwok. Multi-label classification on tree-and dag-structured hierarchies. In Proceedings of the 28th International Conference on Machine Learning (ICML). 2011. pp. 17-24.

[pone.0244430.ref021] 21.Masseroli M, Chicco D, Pinoli P. Probabilistic latent semantic analysis for prediction of gene ontology annotations. International joint conference on neural networks (IJCNN). 2012; pp. 1-8.

[pone.0244430.ref022] 22. Zhong Xiaoshi, Kaalia Rama, and Rajapakse Jagath C. GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings. BMC genomics. 2019. pp. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref023] 23.Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016; pp. 855–864. [DOI] [PMC free article] [PubMed]

[pone.0244430.ref024] 24. Smaili Fatima Zohra, Gao Xin, and Hoehndorf Robert. Onto2vec: joint vector-based representation of biological entities and their ontology-based annotations. Bioinformatics. 2018; pp. i52–i60. 10.1093/bioinformatics/bty259 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref025] 25. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NIPS). 2013; pp. 3111–3119. [Google Scholar]

[pone.0244430.ref026] 26. Wang H, Yan L, Huang H, Ding C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM transactions on computational biology and bioinformatics. 2017; 14(3): 503–513. 10.1109/TCBB.2016.2591529 [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref027] 27. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Briefings in bioinformatics. 2017; 18(5): 851–869. [DOI] [PubMed] [Google Scholar]

[pone.0244430.ref028] 28. Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One. 2015; 10(11): e0141287. 10.1371/journal.pone.0141287 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref029] 29.Liu X. Deep recurrent neural network for protein function prediction from sequence. arXiv:1701.08318[Preprint]. 2017. Available from: https://arxiv.org/abs/1701.08318.

[pone.0244430.ref030] 30. Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2017; 34(4): 660–668. 10.1093/bioinformatics/btx624 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref031] 31. Fa R, Cozzetto D, Wan C, Jones DT. Predicting human protein function with multi-task deep neural networks. PloS One. 2018; 13(6): e0198216. 10.1371/journal.pone.0198216 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244430.ref032] 32.Duong, D. B., Gai, L., Uppunda, A., Le, D., Eskin, E., Li, J. J., et al. Annotating Gene Ontology terms for protein sequences with the Transformer model. bioRxiv [Preprint] 2020. Available from: https://www.biorxiv.org/content/10.1101/2020.01.31.929604v1.abstract.

[pone.0244430.ref033] 33.Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al. Attention is all you need. In Advances in neural information processing systems (NIPS). 2017. pp. 5998-6008.

[pone.0244430.ref034] 34.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In Advances in neural information processing systems (NIPS). 2014. pp. 2672-2680.

[pone.0244430.ref035] 35.Choi, Yunjey, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018. pp. 8789-8797.

[pone.0244430.ref036] 36.Zhang, Zizhao, Lin Yang, and Yefeng Zheng. Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018. pp. 9242-9251.

[pone.0244430.ref037] 37.Ghasedi Dizaji, Kamran, Xiaoqian Wang, and Heng Huang. Semi-supervised generative adversarial network for gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018. pp. 1435-1444.

[pone.0244430.ref038] 38.Ghahramani, Arsham, Fiona M. Watt, and Nicholas M. Luscombe. Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv [Preprint] 2018. Available from: https://www.biorxiv.org/content/10.1101/262501v2.full.

[pone.0244430.ref039] 39. Gupta Anvita, and Zou James. Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence 1. 2019. pp. 105–111. 10.1038/s42256-019-0017-4 [DOI] [Google Scholar]

[pone.0244430.ref040] 40.Wang, Ye, Haochen Wang, Liyang Liu, and Xiaowo Wang. Synthetic promoter design in Escherichia coli based on generative adversarial network. bioRxiv [Preprint] 2019. Available from: https://www.biorxiv.org/content/10.1101/563775v1.abstract. [DOI] [PMC free article] [PubMed]

[pone.0244430.ref041] 41.Wan, Cen, and David T. Jones. Improving protein function prediction with synthetic feature samples created by generative adversarial networks.. bioRxiv [Preprint] 2019. Available from: https://www.biorxiv.org/content/10.1101/730143v1.abstract.

[pone.0244430.ref042] 42.Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In International Conference on Machine Learning (ICML). 2017. pp. 214-223.

[pone.0244430.ref043] 43.Mirza M, Osindero S. Conditional generative adversarial networks. arXiv:1411.1784 [Preprint] 2014. Available from: https://arxiv.org/abs/1709.02023.

[pone.0244430.ref044] 44.Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS). 2017. pp. 5767-5777.

[pone.0244430.ref045] 45. Domenico C, Minneci F, Currant H, Jones DT. FFPred 3: feature-based function prediction for all Gene Ontology domains. Scientific Rep. 2016; 6: 31865. 10.1038/srep31865 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks

Seyyede Fatemeh Seyyedsalehi

Mahdieh Soleymani

Hamid R Rabiee

Mohammad R K Mofrad

Roles

Abstract

Introduction

Materials and methods

Wasserstein generative adversarial network

Generator structure and loss function

Fig 1. The proposed method for protein function prediction.

Discriminator structure and loss function

Training and optimization

Dataset and data representation

Evaluation measures

Results and discussions

Experiment 1

Fig 2. Comparison of BLAST, DeepGO-Seq and PFP-WGAN on dataset 1.

Table 1. Three term-centric measures suitable for unbalanced data.

Table 2. Average prediction time in seconds for 1000 sequences.

Fig 3. Differences between average F1 obtained for PFP-WGAN and DeepGO-Seq for the GO terms in each height (F¯P and F¯D).

Fig 4. F1 obtained for PFP-WGAN and DeepGO-Seq for GO terms as a function of number of available positive training samples.

Experiment 2

Fig 5. Comparison of BLAST, FFPRED, CSSAG, STDNN, MTDNN and PFP-WGAN on dataset 2.

Table 3. MSE of PFP-WGAN and PFP-S against grandtruth.

Fig 6. Sensitivity of the PFP-WGAN on parameter λ1.

Conclusion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Vasilis J Promponas

Roles

Author response to Decision Letter 0

Decision Letter 1

Alexandros Iosiﬁdis

Roles

Author response to Decision Letter 1

Decision Letter 2

Alexandros Iosiﬁdis

Roles

Acceptance letter

Alexandros Iosiﬁdis

Roles

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 3. Differences between average F1 obtained for PFP-WGAN and DeepGO-Seq for the GO terms in each height ( ${\bar{F}}_{P}$ and ${\bar{F}}_{D}$ ).

Fig 6. Sensitivity of the PFP-WGAN on parameter λ₁.