BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo

Hongliang Li; Bin Liu

doi:10.1371/journal.pcbi.1011214

. 2023 Jun 20;19(6):e1011214. doi: 10.1371/journal.pcbi.1011214

BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo

Hongliang Li ¹, Bin Liu ^1,^2,^*

Editor: Maxwell Wing Libbrecht³

PMCID: PMC10313010 PMID: 37339155

Abstract

As the key for biological sequence structure and function prediction, disease diagnosis and treatment, biological sequence similarity analysis has attracted more and more attentions. However, the exiting computational methods failed to accurately analyse the biological sequence similarities because of the various data types (DNA, RNA, protein, disease, etc) and their low sequence similarities (remote homology). Therefore, new concepts and techniques are desired to solve this challenging problem. Biological sequences (DNA, RNA and protein sequences) can be considered as the sentences of “the book of life”, and their similarities can be considered as the biological language semantics (BLS). In this study, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities. 27 semantics analysis methods derived from NLP were introduced to analyse biological sequence similarities, bringing new concepts and techniques to biological sequence similarity analysis. Experimental results show that these semantics analysis methods are able to facilitate the development of protein remote homology detection, circRNA-disease associations identification and protein function annotation, achieving better performance than the other state-of-the-art predictors in the related fields. Based on these semantics analysis methods, a platform called BioSeq-Diabolo has been constructed, which is named after a popular traditional sport in China. The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate different biological sequence similarities in a supervised manner by using Learning to Rank (LTR), and the performance of the constructed methods will be evaluated and analysed so as to recommend the best methods for the users. The web server and stand-alone package of BioSeq-Diabolo can be accessed at http://bliulab.net/BioSeq-Diabolo/server/.

Author summary

Inspired by the similarities between the biological sequences and human languages, we apply the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities, and propose a platform called BioSeq-Diabolo for intelligently and automatically analysing biological sequence similarities. BioSeq-Diabolo is an important updated version of BioSeq-BLM focusing on the homogeneous and heterogeneous biological sequence similarity analysing, which is beyond the reach of any exiting software tool or platform. BioSeq-Diabolo is named after a popular traditional sport in China whose components reflect its analysing procedures. When playing diabolo, the diabolo stably spins reflecting that the BioSeq-Diabolo is able to automatically construct the optimized predictors, and analyse the corresponding performance for a specific biological sequence similarity analysis task.

1. Introduction

All the information determining the structures and functions of DNA, RNA and protein sequences is in their sequences. As one of the fundamental steps in the biological structure and function studies, analysing biological sequence similarities is the foundation of many tasks in bioinformatics, such as protein remote homology detection [1], protein fold recognition [2], protein structure and function prediction [3–5], non-coding RNA and disease association identification [6,7], etc. Therefore, biological sequence similarity analysis has been attracting more and more attentions.

The biological sequence similarities can be further divided into homogeneous biological sequence similarities and heterogeneous biological sequence similarities according to different data types. For the homogeneous biological sequence similarities, the queries and the retrieved samples are homogeneous. For examples, for protein remote homology detection, both the queries and the retrieved samples are protein sequences. Therefore, the homogeneous biological sequence similarities were applied to detect the remote homology relationships [8]. For the heterogeneous biological sequence similarities, the queries and the retrieved samples are heterogeneous. For examples, for circRNA-disease association identification, the queries are circRNAs, while the retrieved samples are diseases. Therefore, the heterogeneous biological sequence similarities were applied to identify the circRNA-disease associations [9].

Because of its importance, some computational methods have been proposed to calculate the biological sequence similarities. For examples, the methods for calculating homogeneous biological sequence similarities have been proposed based on the alignments in an unsupervised manner, such as PSI-BLAST [10], HHblits [11], HMMER [12], DIAMONDScore [13], etc. Later, the supervised methods were proposed, such as HITS-PR-HHblits [14], SMI-BLAST [15]. These methods have been successfully applied to protein remote homology detection, metagenome analysis, prediction of mammalian N6-methyladenosine sites from mRNAs, etc. Because the queries and retrieved samples are belong to different data types, the alignment methods fail to analyze the heterogeneous biological sequence similarities. In this regard, the supervised methods have been proposed. These methods are based on different features, machine learning techniques, biological sequence association networks, etc. Among these methods, the approaches based on deep learning techniques have achieved the state-of-the-art performance, such as HGATLDA [16], GMNN2CD [17], DeepFRI [4], etc. These methods have been successfully applied to lncRNA–disease association prediction, circRNA–disease associations identification, protein function prediction, etc.

All these aforementioned computational methods for analysing the biological sequence similarities have greatly facilitated the developments of many important tasks in bioinformatics. However, they are still suffering from the following disadvantages: i) It is difficult for the existing methods to detect the low homogeneous biological sequence similarities among biological sequences sharing remote homology relationships; ii) The accuracy for calculating the heterogeneous biological sequence similarities is relatively low because of the various data types, whose characteristics are hard to be captured and formulated; iii) All these methods are designed for specific tasks, and their performance evaluation for the other related tasks is unexplored. Therefore, their contributions to the related fields are limited.

In this study, we are to propose a platform called BioSeq-Diabolo for analysing biological sequence similarities based on biological language semantics, which is named after a popular traditional sport in China whose components reflect the analysing procedures of BioSeq-Diabolo. When playing diabolo, the diabolo stably spins reflecting that the BioSeq-Diabolo is able to automatically construct the optimized predictors and analyse the corresponding performance for a specific biological sequence similarity analysis task. To the best knowledge of ours, it is the first platform for systematically analysing both the homogeneous and heterogeneous biological sequence similarities. Inspired by the similarities between the biological sequences and the natural languages, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately calculate the biological sequence similarities. Previous studies have proved that the methods developed for analysing natural languages can be applied to the field of molecular biology based on Noam Chomsky’s formal language theory [18]. Hierarchical analogy between natural languages and biological sequences has been carried out by [19], where protein sequences were regarded as the “raw text” carrying high-level “meanings” of the structures and functions of proteins. Following the analogy between biological sequences and natural languages, the techniques derived from NLP have greatly contributed to the development of biological sequence analysis [20]. Furthermore, motivated by language models, the biological language models have been proposed [21], which can be applied to residue-level and sequence-level biological sequence analysis tasks. The protein language model ProtTrans [22] is important for understanding the language of life through self-supervised learning. As shown in Fig 1, the biological sequence similarity analysis is particularly similar with the semantics similarity analysis in NLP. Sentences determine their semantics of the corresponding natural language, which are the keys for measuring and judging the similarity between sentences. As the sentences of “the books of life” [18], biological sequences determine their structures and functions, which can be considered as the “semantics” of biological sequences. The computational methods for analysing the semantics of natural languages are mature after being studied for decades, giving us an opportunity to apply these methods for improving the performance of biological sequence similarity analysis. In this regard, a total of 27 methods derived from semantics analysis in NLP have been incorporated into BioSeq-Diabolo to analyse the biological sequence similarities, which capture the relevance of biological sequences from distribution, representation and interaction perspectives. Based on these methods, we can uncover the hidden biological language semantics so as to more accurately analyze the biological sequence similarities. Furthermore, the Learning to Rank (LTR) [8,9,23] was employed to integrate the results of different methods in a supervised manner.

Fig 1 — The similarities between biological sequence similarity analysis (a) and natural language semantics analysis (b). Sentences determine their semantics of the corresponding natural language, while biological sequences determine their structures and functions. The biological sequence similarity analysis is particularly similar with the semantics similarity analysis in natural language processing.

In order to help the researchers to study the computational methods for different aims and tasks, we followed the pipelines of intelligent software tools (BioSeq-Analysis2.0 [24], GAIN [25], BioSeq-BLM [21], iLearnPlus [26], etc) to construct the BioSeq-Diabolo platform. The users only need to input the homogeneous or heterogeneous biological sequence data, BioSeq-Diabolo will automatically construct the computational predictors for biological sequence similarity analysis, evaluate the performance, and analyze the results in different views. BioSeq-BLM [21] is a platform for analyzing biological sequences based on biological language models, which is able to automatically construct computational predictors for classification and sequence labelling tasks. BioSeq-Diabolo is an important updated version of BioSeq-BLM focusing on the homogeneous and heterogeneous biological sequence similarity analysing, which is beyond the reach of any exiting software tool or platform. The comparisons between BioSeq-Diabolo and BioSeq-BLM are listed in Table 1. The biological sequence features extracted by BioSeq-BLM can be fed into BioSeq-Diabolo, and the biological sequence similarity scores calculated by BioSeq-Diabolo can also be used as the input features of BioSeq-BLM.

Table 1. The comparisons between BioSeq-Diabolo and BioSeq-BLM.

Descriptions	BioSeq-BLM	BioSeq-Diabolo
Biological sequence analysis tasks	Classification and sequence labelling	Sequence similarity analysis
Number of classification algorithms	8	0
Number of sequence labelling algorithms	9	0
Number of sequence similarities analysing algorithms	0	27
Categories of algorithms used for result analysis	4	6
Support integration or not	No	Yes
Support GPU-accelerate or not	Yes	Yes

Open in a new tab

2. Materials and methods

2.1. Biological sequence similarity analysis tasks

Biological sequence similarities can be further divided into homogeneous biological sequence similarities and heterogeneous biological sequence similarities according to the data types of the target biological sequences. The aim of homogeneous biological sequence similarity analysis is to detect the similarities between two biological sequences with the same data type. For examples, Fig 2A shows the process of analysing the RNA sequence similarity, which is a homogeneous biological sequence similarity analysis task. Fig 2B shows the process of identifying the associations between non-coding RNAs and diseases, which is a heterogeneous biological sequence similarity analysis task.

Fig 2 — (a) Non-coding RNA similarity analysis, which is a homogeneous biological sequence analysis task. (b) Non-coding RNA and disease association identification, which is a heterogeneous biological sequence analysis task. (c) Text matching task, which is a homogeneous language analysis task. (d) Machine translation task, which is a heterogeneous language analysis task. Thick dotted lines indicate high similarity, and thin dotted lines indicate low similarity. The same shape represents homogeneous association, while different shapes represent heterogeneous association.

2.2. Biological sequence similarity analysis based on biological language semantics

Semantics represent the meanings of sentences in a particular context. Even the same sentence in different contexts would have different semantics. In order to understand the semantics, the sentences are represented as abstract representation considering both local and global information of the context, and then the advanced natural language processing techniques are performed on the abstract representation to automatically understand the semantics. The semantics analysis plays a key role in NLP, which is important for machine translation, information retrieval, text generation, etc. As the language of “the books of life”, biological sequences can be considered as the sentences containing all the information for determining their semantics (the structures and functions of the biological sequences). Therefore, the ideas and techniques derived from semantics analysis can be applied to analyse the “semantics” of biological sequences (see Figs 1, 2C and 2D). 27 different biological sequence similarity analysis methods were incorporated into BioSeq-Diabolo to analyse the biological sequence similarities. These methods can be divided into 3 categories, including distribution methods, representation methods and interaction methods. The following sections will introduce these 27 methods and their embeddings.

Embeddings

Embeddings transfer biological sequences to dense vectors via incorporating local and global information (Fig 3A). Based on the effective embeddings, the biological sequence similarity analysis methods can capture hidden patterns from biological sequences, which is fundamental for learning biological language semantics. Intelligent platforms are able to generate different embeddings, such as BioSeq-Analysis2.0 [24], BioSeq-BLM [21], iLearnPlus [26], etc. These platforms and tools represent the biological sequences based on different techniques and theories.

Distribution methods

Distribution methods fully consider the spatial correlation of input pairs, and show good generalization ability for modelling different types of data [27]. After encoded by the embedding layers (Fig 3A), the biological sequences are projected into the high-dimensional space by the mapping layer, and different algorithms are performed on this high dimensional space to calculate the probabilities, treated as the similarity scores (Fig 3B). There are 9 models in distribution methods. For more detailed information of the distribution methods, please refer to S1 Table.

Representation methods

The representation methods employ the Siamese architecture to encode the sentences [27], which can be applied to analyse the biological sequence similarities. Based on Siamese architecture in Contrastive Learning, representation methods efficiently learn the hidden difference and connection of homogeneous biological sequences. Representation methods apply the symmetrical representation layers to extract biological language semantics and matching layers to conduct semantic matching operations based on metric learning (Fig 3C). After semantic matching, the similarity scores are generated by calculating the posteriori probabilities. There are 4 models in representation methods. For more information of the representation methods, please refer to S2 Table.

Interaction methods

Interaction methods employ the hierarchical deep architecture to learn the semantics from the local interaction matrix of query and retrieved documents [27], which is suitable for comprehensively learning the associations between biological sequences. Based on biological sequences and their embeddings, the relevance of two biological sequences is captured by the interaction layer in interaction methods, and the following representation layers learn biological language semantics from interaction matrix (Fig 3D). The similarity scores are generated by calculating the posteriori probabilities. There are 14 models in interaction methods. For more information of interaction methods, please refer to S3 Table.

2.3. BioSeq-Diabolo

Because the biological sequence similarity analysis tasks are diverse with different features and input data, it is often very difficult for the researchers to select the suitable computational methods and techniques for their own tasks and aims. In this regard, an intelligent platform which can automatically construct computational methods for biological sequence similarity analysis, and select the optimized methods for specific tasks is highly required. In this study, we construct a powerful platform called BioSeq-Diabolo for automatically analysing biological sequence similarities based on biological language semantic (Fig 4). The users only need to input the biological sequence data and the parameters, BioSeq-Diabolo will intelligently identify the specific tasks (homogeneous or heterogeneous biological sequence similarity analysis tasks) based on the input embeddings of the biological sequences and the parameters provided by the users, and then will accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate the biological sequence similarities in a supervised fashion by using Learning to Rank (LTR) [28]. Finally, the performance of different constructed methods will be evaluated and analysed.

Fig 4 — BioSeq-Diabolo is named after a popular traditional sport in China whose components reflect its analysing procedures. When playing diabolo, the diabolo stably spins reflecting that the BioSeq-Diabolo is able to automatically construct the optimized predictors, and analyse the corresponding performance for a specific biological sequence similarity analysis task.

Learning to rank

There are 27 biological sequence similarity analysis methods in BioSeq-Diabolo, leading to 1.09◊10²⁸ (27!) different combinations. Furthermore, many embeddings are proposed. For example, BioSeq-BLM [21] can generate 155 different embeddings of biological sequences based on different biological language models. Therefore, the number of base models generated by BioSeq-Diabolo via different biological sequence similarity analysis methods and embeddings is even unlimited. To fully consider the advantages of these base models, we employed Learning to Rank (LTR) [29] to integrate these methods following ProtDec-LTR [8], GOLabeler [23], iCircDA-LTR [9] and DrugE-Rank [30]. Learning to Rank ranks the relevance and importance between queries and documents in a supervised manner. With its help, the biological language semantics will be efficiently explored and uncovered. For more information of LTR, please refer to S1 Text.

Performance evaluation

Four performance measures were used to evaluate the performance of different methods generated by BioSeq-Diabolo, including area under the Receiver Operating Characteristic Curve (AUC) [31], area under the Precision-Recall Curve (AUPR) [32], Normalized Discounted Cumulative Gain (NDCG) [33], ROC [34], F_max [35] and S_min [35]. Please refer to S2 Text for details.

Result analysis

The result analysis module in BioSeq-Diabolo provides multiple visualization functions for output results so as to intuitively show the performance of the predictors. The result analysis module mainly includes the following visualization functions: 1) Performance visualization. BioSeq-Diabolo provides Receiver Operating Characteristic (ROC) curve, Precision-Recall (PR) curve, Radar map and histogram to visualize the performance of the constructed predictors. 2) Similarity network. BioSeq-Diabolo uses NetworkX [36] library to draw the similarity network according to the predictions of the best generated predictors. 3) Complementarity and relevance map. The Heatmap is automatically generated to show the Pearson Correlation [37] of similarity scores predicted by generated predictors. 4) Score distribution. The distribution histogram for similarity scores of constructed predictors is drawn to compare their interval differences. 5) Contribution of integrated predictors. The Learning to Rank calculates the contribution weights of integrated predictors, and the pie chart is drawn to show the contribution of each predictor. 6) Comparison between embeddings and similarity scores. T-distributed Stochastic Neighbor Embedding (TSNE) [38] is applied to show the difference between input embeddings and similarity scores based on dimension reduction.

To summarize, BioSeq-Diabolo constructs various predictors based on biological language semantics, and integrates them to generate the best predictor for specific biological sequence similarity analysis task. The performance of the predictor is then evaluated, and the visualization of the prediction results improves the interpretability of the constructed predictors. All these complicated processes will be automatically conducted by using BioSeq-Diabolo by using only one command line.

2.4. Web server and stand-alone package of BioSeq-Diabolo

Based on the flowchart of BioSeq-Diabolo (S1 Fig), the web server and stand-alone package of BioSeq-Diabolo were developed to facilitate the researchers for analyzing the biological sequence similarities based on biological language semantics.

Web server

We provide detailed tutorial and documents to explain each procedure and option of the web server, which can be accessed at http://bliulab.net/BioSeq-Diabolo/. More specifically, after selecting biological sequence similarity analysis task, choosing biological sequence similarity calculation methods and setting using Learning to Rank or not, the users will see the input page (see http://bliulab.net/BioSeq-Diabolo/server/). The users need to select the parameters for each module in this page, and input biological sequence data in the BLS format. Please refer to the document of BioSeq-Diabolo for detailed descriptions of BLS format. After submitting the form of the input page, the web server will conduct the calculation according to the pipelines. When the calculation is complete, the results will be displayed in the result page (see http://bliulab.net/BioSeq-Diabolo/server/graph_ssc/rank/submit/result/user). The result page includes visualization results, downloadable files and the similarity scores predicted by BioSeq-Diabolo. The command lines of stand-alone package are also provided in the result page so as to help the users to perform the analysis by using their own computing resources via stand-alone package of BioSeq-Diabolo.

Stand-alone package

Stand-alone package ensures that users can make full use of their own computing resources. There are four modules in the BioSeq-Diabolo stand-alone package: 1) biological sequence similarity analysis: ‘sesica_arc.py’, ‘sesica_clf.py’, scripts in ‘arc’ folder and scripts in ‘clf’ folder; 2) Learning to Rank: ‘sesica_rank.py’ and scripts in ‘rank’ folder; 3) Performance evaluation: scripts in ‘utils’ folder; 4) Result analysis: ‘sesica_plot.py’ and scripts in ‘plot’ folder. Additionally, the multiprocessing technique and GPU acceleration are employed to improve its execution efficiency. The Scikit-learn [39] and MatchZoo [40] python libraries are used to implement the stand-alone package. As discussed in results and discussion section, compilated biological sequence similarity analysis tasks can be easily solved by only using one command line with the help of the stand-alone package of BioSeq-Diabolo. Please refer to the README file in the stand-alone package for more details of the three examples shown in results and discussion section.

3. Results and discussion

BioSeq-Diabolo is able to facilitate the development of the computational methods for biological sequence similarity analysis. In this section, we will show the effects of BioSeq-Diabolo for automatically developing computational predictors for solving three important biological sequence similarity analysis tasks, including one homogeneous biological sequence similarity analysis task (protein remote homology detection) and two heterogeneous biological sequence similarity analysis tasks (circRNA-disease associations and protein function annotation). For each task, BioSeq-Diabolo automatically constructs various computational predictors, and selects the best one for the following analysis (see the README file in the stand-alone package of BioSeq-Diabolo).

3.1. BioSeq-Diabolo facilitates the protein remote homology detection

Protein remote homology detection is one of fundamental research tasks in protein sequence analysis, which has been extensively studied for decades [1]. However, because proteins with remote homology relationship share very low sequence similarities (<30%), the existing computational predictors fail to accurately detect the protein remote homologous. Therefore, new concepts and techniques are desired to solve this challenging problem. Here, we investigate if BioSeq-Diabolo can construct powerful computational predictors for protein remote homology detection or not. Trained and evaluated on a benchmark dataset constructed based on SCOP1.75 database [41] (http://bliulab.net/BioSeq-Diabolo/download/), BioSeq-Diabolo constructs various predictors, and analyses their performance by using the following command line:

python sesica_arc.py -data_type homo -bmk_vec bmk_vec.txt -bmk_label pos_label.txt neg_label.txt -arc dssm cdssm drmm drmmtks match_lstm duet knrm -metric roc@50

The performance analysis results of the top 5 best predictors constructed by BioSeq-Diabolo are shown in Fig 5, from which we can see that these predictors perform well for protein remote homology detection, and they are complementary. Is it possible to combine these complementary methods to further improve the predictive performance? In order to answer this question, these methods are combined by a supervised framework Learning to Rank (LTR) by using BioSeq-Diabolo, and the corresponding results are shown in Table 2 along with the performance of the other state-of-the-art predictors in this field, including PSI-BLAST [10], DELTA-BLAST [42], HHblits [11] and HHsearch [43]. From this table we can see that the predictor automatically constructed by BioSeq-Diabolo outperforms all the other approaches in terms of ROC50. These results are not surprising because the semantics analysis techniques derived from natural language processing incorporated in BioSeq-Diabolo are able to efficiently capture the insightful semantics of protein sequences, which are critical for protein remote homology detection [44].

Table 2. The performance of BioSeq-Diabolo compared with competing methods for protein remote homology detection.

Methods	ROC50^c
PSI-BLAST^a	87.40%
DELTA-BLAST^a	88.49%
HHblits^a	90.52%
HHsearch^a	90.45%
BioSeq-Diabolo^b	92.00%

Open in a new tab

^a The results of the competing methods were obtained from [45]. These competing methods and BioSeq-Diabolo were evaluated on the same test dataset. Therefore, these results can be directly compared

^b The results of the best predictor constructed by BioSeq-Diabolo (integrating the top 5 best predictors by using Learning to Rank). The input protein sequence embeddings are based on 2-Kmer [46] by using BioSeq-BLM [21]

^c The performance evaluation indicators were described in S2 Text and details of the reported experiments were described in S3 and S4Texts.

Estimation of time consumption is important for a user-oriented method. Therefore, we evaluated the running time of BioSeq-Diabolo for protein remote homology detection on the proteins randomly extracted from SCOP1.75 database [41] (http://bliulab.net/BioSeq-Diabolo/download/) with the following command line:

python sesica_clf.py -data_type homo -bmk_vec bmk_vec.txt -bmk_label pos_label.txt neg_label.txt -clf svm rf ert knn mlp -metric roc@1

With the increase of the number of samples, the training time of BioSeq-Diabolo increases obviously, while its test time is roughly linearly with the number of test samples (see Fig 6). BioSeq-Diabolo achieved an ROC1 of 0.927 trained with 50000 samples for predicting 50000 test samples. The corresponding training time and test time are 3580 seconds and 243 seconds, respectively. For large scale analysis, we suggest the users to use the stand-alone package. And when the number of samples is the same, the time required for different methods to get the result from scratch are show in S2 Fig. Interaction methods and representation methods (both are based on deep-learning) consume more time than distribution methods (based on traditional machine learning methods).

Fig 6 — The test time was evaluated with the corresponding BioSeq-Diabolo trained with the same number of samples. These experiments were performed on Intel(R) Xeon(R) CPU E5-2660 v3 (2.60 GHz with 10 cores) and memory of 64 G.

3.2. BioSeq-Diabolo facilitates the circRNA-disease association identification

CircRNAs are major regulators in various cellular processes, and associated with the pathogenesis of human diseases. Exactly identifying circRNA-disease associations is critical for researching disease mechanism, and developing corresponding drug targets. Because the existing computational methods ignore the fact that there is a high false positive rate in the negative set, most of existing computational predictors fail to detect potential missing relationship between circRNAs and diseases [9]. Therefore, new approaches for detecting the diseases associated with circRNAs are urgently needed. Here, we applied BioSeq-Diabolo to improve the performance of circRNA-disease association identification. Trained and evaluated on the benchmark dataset [9], BioSeq-Diabolo constructs numerous predictors, and selects the best predictor by using the following command line:

python sesica_clf.py -data_type hetero -bmk_vec_a bmk_circRNA.txt -bmk_vec_b bmk_disease.txt -bmk_label pos_label.txt neg_label.txt -clf svm rf ert knn mnb gbdt goss dart mlp -metric auc -gs_mode 2

The performance analysis results of the top 5 best predictors constructed by BioSeq-Diabolo are shown in Fig 7, indicating that these predictors can accurately identify circRNA-disease associations, and they are complementary. In this regard, we employ Learning to Rank to integrate these 5 complementary predictors, and construct the best predictor by using BioSeq-Diabolo. The performance of the best predictor constructed by BioSeq-Diabolo is shown in Table 3 along with the performance of the other state-of-the-art predictors in this field, including gcForest [47], DWNN-RLS [48], GBDT [49] and iCircDA-LTR [9]. From this table we can see that the predictor automatically constructed by BioSeq-Diabolo outperforms all the other approaches in terms of NDCG and NDCG@10. Based on biological language semantics and Learning to Rank, BioSeq-Diabolo takes the global relationships among circRNAs and all candidate diseases into consideration, which is the main reason for its better performance.

Table 3. The performance of BioSeq-Diabolo compared with competing methods for circRNA-disease association identification.

Methods	NDCG^c	NDCG@10^c
gcForest^a	0.4406	0.3767
DWNN-RLS^a	0.5466	0.4911
GBDT^a	0.5736	0.5280
iCircDA-LTR^a	0.5879	0.5426
BioSeq-Diabolo^b	0.6009	0.5492

Open in a new tab

^a The results of the competing methods were obtained from [9]. These competing methods and BioSeq-Diabolo were evaluated on the same test dataset. Therefore, these results can be directly compared

^b The results of the best predictor constructed by BioSeq-Diabolo (integrating the top 5 best predictors by using Learning to Rank). The input CircRNA embeddings are extracted by BioSeq-Analysis2.0 [24] with PseKNC [50] (parameter φ is set as 0.5). The input disease embeddings are represented by semantic similarity score matrix reported in [9]. We concatenated circRNA features and disease features, fed into BioSeq-Diabolo for further analysis.

^c The performance evaluation indicators were described in S2 Text and details of the reported experiments were described in S3 and S4 Texts.

3.3. BioSeq-Diabolo facilitates the protein function annotation

Because only a few proteins have experimentally validated functional annotations, the computational annotations of protein functions have become a crucial step for understanding of the complex mechanisms of living cells [51]. Although some computational predictors have been proposed to predict protein functions, their performance is still limited prevented by the undefined relationships between protein sequences and their multiple hierarchically organized labels [52]. BioSeq-Diabolo integrates biological sequence semantics from different perspectives, facilitating protein function prediction. Trained and evaluated on a benchmark dataset constructed based on CAFA3 database [52] (http://bliulab.net/BioSeq-Diabolo/download/), BioSeq-Diabolo constructs various predictors, and analyses their performance by using the following command line:

python sesica_clf.py -data_type hetero -bmk_vec_a cc_bmk_vec_a.txt -bmk_vec_b cc_bmk_vec_b.txt -bmk_label pos_label.txt neg_label.txt -clf svm rf ert knn mnb gbdt goss dart mlp -metric aupr -gs_mode 2

The performance analysis results of the top 5 best predictors constructed by BioSeq-Diabolo are shown in Fig 8. In the same way, BioSeq-Diabolo integrates the top 5 best predictors by using Learning to Rank. The integrated predictor is highly comparable with the other state-of-the-art predictors, including DIAMONDScore [13], DeepGO [35] and DeepGOCNN [53] (see Tables 4, S4 and S5). These experimental results show that BioSeq-Diabolo is also able to facilitate the development of the computational predictors for protein function prediction.

Table 4. The performance of BioSeq-Diabolo compared with competing methods in Cellular Component Ontology (CCO) for protein function annotation.

Methods	AUPR^c	Fmax^c	Smin^c
Naive¹	0.483	0.541	8.466
DIAMONDScore¹	0.500	0.523	8.347
DeepGO¹	0.446	0.503	5.791
DeepGOCNN¹	0.523	0.582	5.234
BioSeq-Diabolo^b	0.514	0.577	5.569

Open in a new tab

^a The results of the competing methods were obtained from [54]. These competing methods and BioSeq-Diabolo were evaluated on the same test dataset. Therefore, these results can be directly compared

^b The result of the best predictor constructed by BioSeq-Diabolo (integrating top 5 best predictor by using Learning to Rank). The input protein sequence embeddings are extracted by BioSeq-BLM [21] with Position-Specific method [55]. The input GO term embeddings are represented by label embedding matrix reported in [54]

^c The performance evaluation indicators were described in S2 Text and details of the reported experiments were described in S3 and S4 Texts.

4. Conclusion

Biological sequence similarity analysis plays a critical role in the biological structure and function studies. As the language of “the books of life”, like natural language, biological sequence is information-complete, and has its own semantics. Based on biological language semantics, we can better understand the semantics of “the books of life”. The platform BioSeq-Diabolo automatically analyses biological sequence similarities based on biological language semantic for different tasks in bioinformatics. Experimental results show that the predictors automatically generated by BioSeq-Diabolo even outperform the state-of-the-art predictors in the fields of protein remote homology detection, circRNA-disease association identification and protein function annotation, indicating that BioSeq-Diabolo will provide new concepts and techniques for biological sequence analysis, and facilitate the development of new computational predictors for biological sequence analysis. Although BioSeq-Diabolo incorporates some state-of-the-art biological sequence similarity analysis methods, we will focus on integrating more powerful algorithms into BioSeq-Diabolo in our future studies. We believe that BioSeq-Diabolo will play important roles in biological sequence analysis.

Supporting information

S1 Table. The distribution methods and their descriptions.

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

S2 Table. Representation methods and their descriptions.

(DOCX)

Click here for additional data file.^{(15.4KB, docx)}

S3 Table. Representation methods and their descriptions.

(DOCX)

Click here for additional data file.^{(18.5KB, docx)}

S4 Table. The performance of BioSeq-Diabolo compared with competing methods in Molecular Function Ontology (MFO) for protein function annotation.

(DOCX)

Click here for additional data file.^{(18.2KB, docx)}

S5 Table. The performance of BioSeq-Diabolo compared with competing methods in Biological Process Ontology (BPO) for protein function annotation.

(DOCX)

Click here for additional data file.^{(18.2KB, docx)}

S1 Fig. The flowchart of BioSeq-Diabolo.

The web server and stand-alone package of BioSeq-Diabolo were developed based on this procedure.

(TIF)

Click here for additional data file.^{(155.3KB, tif)}

S2 Fig. The running time required for 9 different methods to get the result from scratch.

The running time contained training stage and test stage (50000 training examples and 50000 test samples). These experiments were performed on Intel(R) Xeon(R) CPU E5-2660 v3 (2.60 GHz with 10 cores) and memory of 64 G.

(TIF)

Click here for additional data file.^{(180.6KB, tif)}

S1 Text. Learning to Rank.

(DOCX)

Click here for additional data file.^{(16.1KB, docx)}

S2 Text. Performance evaluation indicators.

(DOCX)

Click here for additional data file.^{(22.3KB, docx)}

S3 Text. Method optimization and Result visualization.

(DOCX)

Click here for additional data file.^{(13.5KB, docx)}

S4 Text. The details of the reported experiments.

(DOCX)

Click here for additional data file.^{(23.7KB, docx)}

Data Availability

The datasets used to demonstrate the effectiveness of BioSeq-Diabolo are publicly accessed at http://bliulab.net/BioSeq-Diabolo/download/. The BioSeq-Diabolo web server is freely accessible via http://bliulab.net/BioSeq-Diabolo/, and the source code of stand-alone package of BioSeq-Diabolo can be available at https://github.com/Zimiao1025/Sesica under the BSD-2-Clause license.

Funding Statement

This work was supported by the National Natural Science Foundation of China (No. U22A2039, 62271049 and 62250028) to (BL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Systems. 2021;12(6):654–69.e3. doi: 10.1016/j.cels.2021.05.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Jo T, Hou J, Eickholt J, Cheng J. Improving Protein Fold Recognition by Deep Learning Networks. Scientific Reports. 2015;5(1):17573. doi: 10.1038/srep17573 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596(7873):590–6. doi: 10.1038/s41586-021-03828-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nature Communications. 2021;12(1):3168. doi: 10.1038/s41467-021-23303-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Callaway E. ’It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures. Nature. 2020;588(7837):203–4. Epub 2020/12/02. doi: 10.1038/d41586-020-03348-4 . [DOI] [PubMed] [Google Scholar]
6.Li J, Zhang S, Wan Y, Zhao Y, Shi J, Zhou Y, et al. MISIM v2.0: a web server for inferring microRNA functional similarity based on microRNA-disease associations. Nucleic Acids Research. 2019;47(W1):W536–W41. doi: 10.1093/nar/gkz328 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Hu Y, Chen C-h, Ding Y-y, Wen X, Wang B, Gao L, et al. Optimal control nodes in disease-perturbed networks as targets for combination therapy. Nature Communications. 2019;10(1):2180. doi: 10.1038/s41467-019-10215-y [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chen J, Guo M, Li S, Liu B. ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics. 2017;33(21):3473–6. doi: 10.1093/bioinformatics/btx429 [DOI] [PubMed] [Google Scholar]
9.Wei H, Xu Y, Liu B. iCircDA-LTR: identification of circRNA–disease associations based on Learning to Rank. Bioinformatics. 2021;37(19):3302–10. doi: 10.1093/bioinformatics/btab334 [DOI] [PubMed] [Google Scholar]
10.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods. 2012;9(2):173–5. doi: 10.1038/nmeth.1818 [DOI] [PubMed] [Google Scholar]
12.Eddy SR. Accelerated Profile HMM Searches. PLoS computational biology. 2011;7(10):e1002195. Epub 2011/11/01. doi: 10.1371/journal.pcbi.1002195 ; PubMed Central PMCID: PMC3197634. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods. 2021;18(4):366–8. doi: 10.1038/s41592-021-01101-x [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Briefings in bioinformatics. 2020;21(1):298–308. doi: 10.1093/bib/bby104 [DOI] [PubMed] [Google Scholar]
15.Jin X, Liao Q, Wei H, Zhang J, Liu B. SMI-BLAST: a novel supervised search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics. 2021;37(7):913–20. doi: 10.1093/bioinformatics/btaa772 [DOI] [PubMed] [Google Scholar]
16.Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on meta-paths for lncRNA–disease association prediction. Briefings in bioinformatics. 2022;23(1):bbab407. doi: 10.1093/bib/bbab407 [DOI] [PubMed] [Google Scholar]
17.Niu M, Zou Q, Wang C. GMNN2CD: identification of circRNA–disease associations based on variational inference and graph Markov neural networks. Bioinformatics. 2022;38(8):2246–53. doi: 10.1093/bioinformatics/btac079 [DOI] [PubMed] [Google Scholar]
18.Searls DB. The language of genes. Nature. 2002;420(6912):211–7. doi: 10.1038/nature01255 [DOI] [PubMed] [Google Scholar]
19.Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J. Computational Biology and Language. In: Cai Y, editor. Ambient Intelligence for Scientific Discovery: Foundations, Theories, and Systems. Berlin, Heidelberg; 2005. p. 25–47. [Google Scholar]
20.Gimona M. Protein linguistics—a grammar for modular protein assembly? Nature Reviews Molecular Cell Biology. 2006;7(1):68–73. doi: 10.1038/nrm1785 [DOI] [PubMed] [Google Scholar]
21.Li H-L, Pang Y-H, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Research. 2021;49(22):e129-e. doi: 10.1093/nar/gkab829 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021:1-. doi: 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
23.You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics. 2018;34(14):2465–73. doi: 10.1093/bioinformatics/bty130 [DOI] [PubMed] [Google Scholar]
24.Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Research. 2019;47(20):e127-e. doi: 10.1093/nar/gkz740 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Bass JIF, Diallo A, Nelson J, Soto JM, Myers CL, Walhout AJM. Correction: Corrigendum: Using networks to measure similarity between genes: association index selection. Nature Methods. 2014;11(3):349-. doi: 10.1038/nmeth0314-349c [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y-Z, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Research. 2021;49(10):e60-e. doi: 10.1093/nar/gkab122 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Chandrasekaran D, Mago V. Evolution of Semantic Similarity—A Survey. ACM Comput Surv. 2021;54(2):41. doi: 10.1145/3440755 [DOI] [Google Scholar]
28.Wu Q, Burges CJC, Svore KM, Gao J. Adapting boosting for information retrieval measures. Information Retrieval. 2010;13(3):254–70. doi: 10.1007/s10791-009-9112-1 [DOI] [Google Scholar]
29.Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, et al., editors. Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning; 2005. [Google Scholar]
30.Yuan Q, Gao J, Wu D, Zhang S, Mamitsuka H, Zhu S. DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank. Bioinformatics. 2016;32(12):i18–i27. doi: 10.1093/bioinformatics/btw244 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 1975;12(4):387–415. doi: 10.1016/0022-2496(75)90001-2 [DOI] [Google Scholar]
32.Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–74. doi: 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
33.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20(4):422–46. doi: 10.1145/582415.582418 [DOI] [Google Scholar]
34.Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry. 1996;20(1):25–33. doi: 10.1016/s0097-8485(96)80004-0 [DOI] [PubMed] [Google Scholar]
35.Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34(4):660–8. doi: 10.1093/bioinformatics/btx624 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Hagberg AA, Schult DA, Swart P, editors. Exploring Network Structure, Dynamics, and Function using NetworkX. Proceedings of the 7th Python in Science conference; 2008. [Google Scholar]
37.Saccenti E, Hendriks MHWB, Smilde AK. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Scientific Reports. 2020;10(1):438. doi: 10.1038/s41598-019-57247-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature Communications. 2019;10(1):5416. doi: 10.1038/s41467-019-13056-x [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
40.Guo J, Fan Y, Ji X, Cheng X. MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching. Proceedings of the 42nd International ACM SIGIR Conference on Research Development in Information Retrieval. 2019. [Google Scholar]
41.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247(4):536–40. Epub 1995/04/07. doi: 10.1006/jmbi.1995.0159 . [DOI] [PubMed] [Google Scholar]
42.Boratyn GM, Schäffer AA, Agarwala R, Altschul SF, Lipman DJ, Madden TL. Domain enhanced lookup time accelerated BLAST. Biology Direct. 2012;7(1):12. doi: 10.1186/1745-6150-7-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Söding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005;21(7):951–60. doi: 10.1093/bioinformatics/bti125 [DOI] [PubMed] [Google Scholar]
44.Yu L, Tanwar Deepak K, Penha Emanuel Diego S, Wolf Yuri I, Koonin Eugene V, Basu Malay K. Grammar of protein domain architectures. Proceedings of the National Academy of Sciences. 2019;116(9):3636–45. doi: 10.1073/pnas.1814684116 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Shao J, Chen J, Liu B. ProtRe-CN: Protein Remote Homology Detection by Combining Classification Methods and Network Methods via Learning to Rank. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2021:1-. doi: 10.1109/TCBB.2021.3108168 [DOI] [PubMed] [Google Scholar]
46.Liu B, Liu F, Wang X, Chen J, Fang L, Chou K-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Zeng X, Zhong Y, Lin W, Zou Q. Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods. Briefings in bioinformatics. 2020;21(4):1425–36. doi: 10.1093/bib/bbz080 [DOI] [PubMed] [Google Scholar]
48.Yan C, Wang J, Wu F-X. DWNN-RLS: regularized least squares method for predicting circRNA-disease associations. BMC Bioinformatics. 2018;19(19):520. doi: 10.1186/s12859-018-2522-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Lei X, Fang Z, Guo L. Predicting circRNA–Disease Associations Based on Improved Collaboration Filtering Recommendation System With Multiple Data. Frontiers in Genetics. 2019;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Chen W, Lin H, Chou K-C. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Molecular BioSystems. 2015;11(10):2620–34. doi: 10.1039/c5mb00155b [DOI] [PubMed] [Google Scholar]
51.Torres M, Yang H, Romero AE, Paccanaro A. Protein function prediction for newly sequenced organisms. Nature Machine Intelligence. 2021;3(12):1050–60. doi: 10.1038/s42256-021-00419-7 [DOI] [Google Scholar]
52.Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biology. 2019;20(1):244. doi: 10.1186/s13059-019-1835-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422–9. doi: 10.1093/bioinformatics/btz595 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Cao Y, Shen Y. TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding. Bioinformatics. 2021;37(18):2825–33. doi: 10.1093/bioinformatics/btab198 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology. 2016;34(2):184–91. doi: 10.1038/nbt.3437 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011214.r001

Decision Letter 0

William Stafford Noble, Maxwell Wing Libbrecht

16 Dec 2022

Dear Prof. Liu,

Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

As you will see from the reports below, the reviewers note the good results and usability of the code web server. However, they identify many significant issues. In particular, the manuscript lacks description of the methods (particularly of the use of LTR, the main contribution) and the evaluation experiments. Please ensure that the reader can evaluate from the text whether evaluation experiments support the desired conclusions, including a description of what information is held out from which parts of the approach to avoid data leakage and, more generally, what biological problem these experiments are simulating. Also, it seems that some performance numbers are copied from other papers; please reproduce the results or describe how you ensured that evaluation regime is exactly reproduced. Without substantial revisions such that the manuscript precisely describes its methods and accurate evaluation methodology, we will be unlikely to send the paper back to review.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Maxwell Wing Libbrecht, Ph.D.

Academic Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************

As you will see from the reports below, the reviewers note the good results and usability of the code web server. However, they identify many significant issues. In particular, they manuscript lacks description of the methods (particularly of the use LTR, the main contribution) and the evaluation experiments. Please ensure that the reader can evaluate from the text whether evaluation experiments support the desired conclusions, including a description of what information is held out from which parts of the approach to avoid data leakage and, more generally, what biological problem these experiments are simulating. Also, it seems that some performance numbers are copied from other papers; please reproduce the results or describe how you ensured that evaluation regime is exactly reproduced. Without substantial revisions such that the manuscript precisely describes its methods and accurate evaluation methodology, we will be unlikely to send the paper back to review.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Li and Liu introduced a new method called BioSeq-Diabolo to calculate the biological sequence similarities. Biological sequence similarity analysis is an important task in bioinformatics, and efficient methods are desired. Generally, BioSeq-Diabolo is interesting, because it uses the techniques derived from the field of natural language processing, bringing new techniques and concepts to solve this task. The corresponding webserver has been constructed, and the standalone package has been released as well. I have tested both the webserver and the package, and they worked well as described. I believe that they will be particularly interesting for the researchers who are working on the related fields. I have the following comments to improve the presentation of this manuscript.

1. From the webserver, I understand that there are several pipeline software tools as listed in the right part. However, the relationships between these software tools and BioSeq-Diabolo should be explained in the webserver site as well. As a result, the users are able to use these software tools for their own tasks and aims.

2. The information or references of the datasets provided in the Download section (http://bliulab.net/BioSeq-Diabolo/download/) should be given.

3. I have downloaded the stand-alone package from http://bliulab.net/BioSeq-Diabolo/static/download/Sesica.tar.gz, and I felt this package is easy to use, especially for the large dataset analysis aims. However, I failed to find the files for the examples reported in the main text. I suggest Li and Liu providing this information so as to help the users reproducing the reported experiments.

4. The figure 2 is confusing. More information should be provided in its legend.

5. The structure shown in figure 4 is clear, which is very similar as the Diabolo, but the input and output shown in the top part flowchart is not clear. The two important components should be shown in different colors

6. In the current study, Li and Liu used four performance measures to evaluate their performance. Did they follow other studies, and why?

7. For a given methods in BioSeq-Diabolo, it is difficult for the users to select the best one for their own tasks. Did Li and Liu consider this problem? It will be more useful to add the function to automatically select the optimized methods for different tasks. Furthermore, there are several parameters for the different methods. The parameter optimization function should also be added.

8. The result visualization part is elegant and clear, but I cannot find the corresponding figures when using the standalone package. Can Li and Liu also add these visualization function into the standalone package as well?

9. This manuscript presented a very useful tool for biological sequence similarity analysis, but more discussions of its potential applications should be given in the conclusion section in order to attract the readers of Plos computational biology.

Reviewer #2: The authors established the web server and the stand-alone package for biological sequence similarity analysis called BioSeq-Diabolo. Its performance was well evaluated for different biological sequence similarity analysis problems. BioSeq-Diabolo is easily combined with their previously established software tools, and it is the first comprehensive platform incorporating various methods from natural language processing (NLP). It is reasonable to conduct the biological sequence similarity analysis based on the NLP methods considering the similarities between the biological sequences and the natural language sentences.

The following revisions are suggested:

1) The heterogeneous biological sequence similarity analysis shown in BioSeq-Diabolo will contribute to many problems in bioinformatics, such as non-coding RNA and disease association prediction. In the current version of the manuscript, it is not clear how to combine the heterogeneous features (such as non-coding RNAs and diseases) to make the prediction. More explanations are needed.

2) The information shown in table 1 clearly indicates that BioSeq-Diabolo and their previous method BLM are complementary, dealing with different problems. For example, Diabolo is for biological sequence similarity analysis, and BLM is for classification or sequence labelling problem in bioinformatics. More explanations of their relationships are need in both the main text and the web site of Diabolo.

3) Although figure 1 shows the similarities between sequence similarity analysis and semantics analysis, these similarities are not so clear. More explanations and discussions are needed, especially for their similarities among the individual steps shown in this figure.

4) In section 3.2, the concept of embedding is not clear. Dose it mean extract the features of the biological sequences?

5) Learning to Rank is a method to combine different methods. This method seems to be better than the other unsurprised ones. As discussed in the corresponding section, it has been successfully applied to solve many problems in bioinformatics. The authors are suggested to show more details of Learning to Rank, because it is an important step in Diabolo.

6) Written problems: the last column in Table 1, “Sequence similarities analysing” -> “Sequence similarity analysis”.

7) The web server of Diabolo is a big advantage, and it is easy to use. I only need to feed the features, and the web server will output the desired analysis results. I validated it with both the examples provided by the authors and my own data, and I got the desired results. So I think the web server will popular with the other users as well, but Some improvements are desired. For term explanation, the authors are suggested to add the corresponding references of the terms, which will help the user to get the correct information. The main buttons (“Server”, “Tutorial” and “Document”) are suggested to be more clear and bigger.

8) More discussions of the advantages and disadvantages of Diabolo are suggested to given in the conclusion part.

Reviewer #3: In this work, by using the LTR integration method to analyze the similarity of different biological sequences (DNA, RNA, protein, disease, etc), an unified platform for systematically analyzing the similarity of homogeneous and heterogeneous biological sequences has been realized. The text is logically coherent, and the server functions provided are complete, powerful and useful. Here are some comments and suggestions to further enhance this study.

Major

1) In the three tests, the descriptions of the evaluation indicators are missing, and they don't appear in the supplementary material either.

2) It is said that "The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics." SO how BioSeq-Diabolo intelligently identifies tasks is not explained in the paper.

3) In the protein function annotation task, the official evaluation metrics in CAFA are F-max and S_min, which are missing in this work.

4. Lack of running time reports and performance evaluation of models under large-scale data. Estimation of time consumption is important for a user-oriented method.

Minor

1. Figure 1, a, the final protein should be "1D3B" instead of "D3B"

2. Figure 4, "Learning to Rank" part, "Predicted diseases list for topic", "topic" is used in NLP background. Here it should be "Predicted diseases list for RNAs". The same problem exists in the Figure of the server(http://bliulab.net/BioSeq-Diabolo/server/).

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: None

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2023 Jun 20;19(6):e1011214. doi: 10.1371/journal.pcbi.1011214.r002

Author response to Decision Letter 0

4 Mar 2023

Attachment

Submitted filename: Reponse letter.docx

Click here for additional data file.^{(29.1KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011214.r003

Decision Letter 1

William Stafford Noble, Maxwell Wing Libbrecht

16 Mar 2023

Dear Prof. Liu,

Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board. We would like to invite the resubmission of a significantly-revised version that takes into account the comments below.

Thank you for your revised manuscript. Upon initial reading, it seems that a key comment from the reviewers (e.g. R3-C1) remains unaddressed. It seems that descriptions of AUC and AUPR in general were added, but there is still no description of the experiments themselves. Please ensure that the reader can fully understand and reproduce each experiment from the text (main or supplementary), including the source of the data, all preprocessing performed, how fields from the source database were mapped to machine learning concepts (e.g. how true label y_i is defined for each task), the division of data into train and test sets, etc. This issue must be addressed before the manuscript can be re-sent for review. Also, when referencing the supplemental material, please include a reference to the specific section in question.

When you are ready to resubmit, please upload the following:

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Sincerely,

Maxwell Wing Libbrecht, Ph.D.

Academic Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************

Figure Files:

Data Requirements:

Reproducibility:

PLoS Comput Biol. 2023 Jun 20;19(6):e1011214. doi: 10.1371/journal.pcbi.1011214.r004

Author response to Decision Letter 1

21 Mar 2023

Attachment

Submitted filename: Reponse letter.docx

Click here for additional data file.^{(29.3KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011214.r005

Decision Letter 2

William Stafford Noble, Maxwell Wing Libbrecht

18 Apr 2023

Dear Prof. Liu,

Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Maxwell Wing Libbrecht, Ph.D.

Academic Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: It is revised well

Reviewer #2: The authors have improved their work and addressed my comments. Happy to recommend this version for publication in PLoS CB.

Reviewer #3: The author has actively responded to our comments and resolved most of them. Here are some additional suggestions to further enhance this study.

1. The author added "The details of the reported experiments" section in the supplementary material to further explain the details of the experiment. But as far as the protein function annotation problem is concerned, the experimental implementation has the following problem:

Generally speaking, when evaluating protein function annotation, most of the researchers will evaluate MFO, BPO, and CCO separately, instead of only evaluating CCO.

2. I think the training time should not be the focus of the evaluation. Instead, the experiment should focus on evaluating the time consumption between different approaches. That is, when the number of samples is the same, the time required for different methods to get the result from scratch.

3. "BioSeq- Diabolo will intelligently identify the specific tasks (homogeneous or heterogeneous biological sequence similarity analysis tasks) based on the input embeddings of the biological sequences and the parameters provided by the users." This section needs to specify how to identify specific tasks.

4. In addition, users only need to input biological sequence data, but generating different input embeddings from biological sequence data requires different models. How does BioSeq-Diabolo select the corresponding model.

5. The layout of Figure 4 is suitable, and many subplots are too small to be seen clearly. You can try to regroup the subplots. The theme of Diabolo does not need to be presented on this diagram, which hinders the normal layout of the diagram and can be reflected in your LOGO.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: None

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

Data Requirements:

Reproducibility:

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2023 Jun 20;19(6):e1011214. doi: 10.1371/journal.pcbi.1011214.r006

Author response to Decision Letter 2

19 May 2023

Attachment

Submitted filename: Response Letter_v1.docx

Click here for additional data file.^{(19.9KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011214.r007

Decision Letter 3

William Stafford Noble, Maxwell Wing Libbrecht

24 May 2023

Dear Prof. Liu,

We are pleased to inform you that your manuscript 'BioSeq-Diabolo : biological sequence similarity analysis using Diabolo' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Maxwell Wing Libbrecht, Ph.D.

Academic Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011214.r008

Acceptance letter

William Stafford Noble, Maxwell Wing Libbrecht

12 Jun 2023

PCOMPBIOL-D-22-01464R3

BioSeq-Diabolo : biological sequence similarity analysis using Diabolo

Dear Dr Liu,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. The distribution methods and their descriptions.

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

S2 Table. Representation methods and their descriptions.

(DOCX)

Click here for additional data file.^{(15.4KB, docx)}

S3 Table. Representation methods and their descriptions.

(DOCX)

Click here for additional data file.^{(18.5KB, docx)}

S4 Table. The performance of BioSeq-Diabolo compared with competing methods in Molecular Function Ontology (MFO) for protein function annotation.

(DOCX)

Click here for additional data file.^{(18.2KB, docx)}

S5 Table. The performance of BioSeq-Diabolo compared with competing methods in Biological Process Ontology (BPO) for protein function annotation.

(DOCX)

Click here for additional data file.^{(18.2KB, docx)}

S1 Fig. The flowchart of BioSeq-Diabolo.

The web server and stand-alone package of BioSeq-Diabolo were developed based on this procedure.

(TIF)

Click here for additional data file.^{(155.3KB, tif)}

S2 Fig. The running time required for 9 different methods to get the result from scratch.

(TIF)

Click here for additional data file.^{(180.6KB, tif)}

S1 Text. Learning to Rank.

(DOCX)

Click here for additional data file.^{(16.1KB, docx)}

S2 Text. Performance evaluation indicators.

(DOCX)

Click here for additional data file.^{(22.3KB, docx)}

S3 Text. Method optimization and Result visualization.

(DOCX)

Click here for additional data file.^{(13.5KB, docx)}

S4 Text. The details of the reported experiments.

(DOCX)

Click here for additional data file.^{(23.7KB, docx)}

Attachment

Submitted filename: Reponse letter.docx

Click here for additional data file.^{(29.1KB, docx)}

Attachment

Submitted filename: Reponse letter.docx

Click here for additional data file.^{(29.3KB, docx)}

Attachment

Submitted filename: Response Letter_v1.docx

Click here for additional data file.^{(19.9KB, docx)}

Data Availability Statement

[pcbi.1011214.ref001] 1.Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Systems. 2021;12(6):654–69.e3. doi: 10.1016/j.cels.2021.05.017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref002] 2.Jo T, Hou J, Eickholt J, Cheng J. Improving Protein Fold Recognition by Deep Learning Networks. Scientific Reports. 2015;5(1):17573. doi: 10.1038/srep17573 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref003] 3.Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596(7873):590–6. doi: 10.1038/s41586-021-03828-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref004] 4.Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nature Communications. 2021;12(1):3168. doi: 10.1038/s41467-021-23303-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref005] 5.Callaway E. ’It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures. Nature. 2020;588(7837):203–4. Epub 2020/12/02. doi: 10.1038/d41586-020-03348-4 . [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref006] 6.Li J, Zhang S, Wan Y, Zhao Y, Shi J, Zhou Y, et al. MISIM v2.0: a web server for inferring microRNA functional similarity based on microRNA-disease associations. Nucleic Acids Research. 2019;47(W1):W536–W41. doi: 10.1093/nar/gkz328 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref007] 7.Hu Y, Chen C-h, Ding Y-y, Wen X, Wang B, Gao L, et al. Optimal control nodes in disease-perturbed networks as targets for combination therapy. Nature Communications. 2019;10(1):2180. doi: 10.1038/s41467-019-10215-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref008] 8.Chen J, Guo M, Li S, Liu B. ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics. 2017;33(21):3473–6. doi: 10.1093/bioinformatics/btx429 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref009] 9.Wei H, Xu Y, Liu B. iCircDA-LTR: identification of circRNA–disease associations based on Learning to Rank. Bioinformatics. 2021;37(19):3302–10. doi: 10.1093/bioinformatics/btab334 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref010] 10.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref011] 11.Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods. 2012;9(2):173–5. doi: 10.1038/nmeth.1818 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref012] 12.Eddy SR. Accelerated Profile HMM Searches. PLoS computational biology. 2011;7(10):e1002195. Epub 2011/11/01. doi: 10.1371/journal.pcbi.1002195 ; PubMed Central PMCID: PMC3197634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref013] 13.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods. 2021;18(4):366–8. doi: 10.1038/s41592-021-01101-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref014] 14.Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Briefings in bioinformatics. 2020;21(1):298–308. doi: 10.1093/bib/bby104 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref015] 15.Jin X, Liao Q, Wei H, Zhang J, Liu B. SMI-BLAST: a novel supervised search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics. 2021;37(7):913–20. doi: 10.1093/bioinformatics/btaa772 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref016] 16.Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on meta-paths for lncRNA–disease association prediction. Briefings in bioinformatics. 2022;23(1):bbab407. doi: 10.1093/bib/bbab407 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref017] 17.Niu M, Zou Q, Wang C. GMNN2CD: identification of circRNA–disease associations based on variational inference and graph Markov neural networks. Bioinformatics. 2022;38(8):2246–53. doi: 10.1093/bioinformatics/btac079 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref018] 18.Searls DB. The language of genes. Nature. 2002;420(6912):211–7. doi: 10.1038/nature01255 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref019] 19.Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J. Computational Biology and Language. In: Cai Y, editor. Ambient Intelligence for Scientific Discovery: Foundations, Theories, and Systems. Berlin, Heidelberg; 2005. p. 25–47. [Google Scholar]

[pcbi.1011214.ref020] 20.Gimona M. Protein linguistics—a grammar for modular protein assembly? Nature Reviews Molecular Cell Biology. 2006;7(1):68–73. doi: 10.1038/nrm1785 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref021] 21.Li H-L, Pang Y-H, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Research. 2021;49(22):e129-e. doi: 10.1093/nar/gkab829 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref022] 22.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021:1-. doi: 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref023] 23.You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics. 2018;34(14):2465–73. doi: 10.1093/bioinformatics/bty130 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref024] 24.Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Research. 2019;47(20):e127-e. doi: 10.1093/nar/gkz740 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref025] 25.Bass JIF, Diallo A, Nelson J, Soto JM, Myers CL, Walhout AJM. Correction: Corrigendum: Using networks to measure similarity between genes: association index selection. Nature Methods. 2014;11(3):349-. doi: 10.1038/nmeth0314-349c [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref026] 26.Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y-Z, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Research. 2021;49(10):e60-e. doi: 10.1093/nar/gkab122 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref027] 27.Chandrasekaran D, Mago V. Evolution of Semantic Similarity—A Survey. ACM Comput Surv. 2021;54(2):41. doi: 10.1145/3440755 [DOI] [Google Scholar]

[pcbi.1011214.ref028] 28.Wu Q, Burges CJC, Svore KM, Gao J. Adapting boosting for information retrieval measures. Information Retrieval. 2010;13(3):254–70. doi: 10.1007/s10791-009-9112-1 [DOI] [Google Scholar]

[pcbi.1011214.ref029] 29.Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, et al., editors. Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning; 2005. [Google Scholar]

[pcbi.1011214.ref030] 30.Yuan Q, Gao J, Wu D, Zhang S, Mamitsuka H, Zhu S. DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank. Bioinformatics. 2016;32(12):i18–i27. doi: 10.1093/bioinformatics/btw244 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref031] 31.Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 1975;12(4):387–415. doi: 10.1016/0022-2496(75)90001-2 [DOI] [Google Scholar]

[pcbi.1011214.ref032] 32.Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–74. doi: 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]

[pcbi.1011214.ref033] 33.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20(4):422–46. doi: 10.1145/582415.582418 [DOI] [Google Scholar]

[pcbi.1011214.ref034] 34.Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry. 1996;20(1):25–33. doi: 10.1016/s0097-8485(96)80004-0 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref035] 35.Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34(4):660–8. doi: 10.1093/bioinformatics/btx624 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref036] 36.Hagberg AA, Schult DA, Swart P, editors. Exploring Network Structure, Dynamics, and Function using NetworkX. Proceedings of the 7th Python in Science conference; 2008. [Google Scholar]

[pcbi.1011214.ref037] 37.Saccenti E, Hendriks MHWB, Smilde AK. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Scientific Reports. 2020;10(1):438. doi: 10.1038/s41598-019-57247-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref038] 38.Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature Communications. 2019;10(1):5416. doi: 10.1038/s41467-019-13056-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref039] 39.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]

[pcbi.1011214.ref040] 40.Guo J, Fan Y, Ji X, Cheng X. MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching. Proceedings of the 42nd International ACM SIGIR Conference on Research Development in Information Retrieval. 2019. [Google Scholar]

[pcbi.1011214.ref041] 41.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247(4):536–40. Epub 1995/04/07. doi: 10.1006/jmbi.1995.0159 . [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref042] 42.Boratyn GM, Schäffer AA, Agarwala R, Altschul SF, Lipman DJ, Madden TL. Domain enhanced lookup time accelerated BLAST. Biology Direct. 2012;7(1):12. doi: 10.1186/1745-6150-7-12 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref043] 43.Söding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005;21(7):951–60. doi: 10.1093/bioinformatics/bti125 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref044] 44.Yu L, Tanwar Deepak K, Penha Emanuel Diego S, Wolf Yuri I, Koonin Eugene V, Basu Malay K. Grammar of protein domain architectures. Proceedings of the National Academy of Sciences. 2019;116(9):3636–45. doi: 10.1073/pnas.1814684116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref045] 45.Shao J, Chen J, Liu B. ProtRe-CN: Protein Remote Homology Detection by Combining Classification Methods and Network Methods via Learning to Rank. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2021:1-. doi: 10.1109/TCBB.2021.3108168 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref046] 46.Liu B, Liu F, Wang X, Chen J, Fang L, Chou K-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref047] 47.Zeng X, Zhong Y, Lin W, Zou Q. Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods. Briefings in bioinformatics. 2020;21(4):1425–36. doi: 10.1093/bib/bbz080 [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref048] 48.Yan C, Wang J, Wu F-X. DWNN-RLS: regularized least squares method for predicting circRNA-disease associations. BMC Bioinformatics. 2018;19(19):520. doi: 10.1186/s12859-018-2522-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref049] 49.Lei X, Fang Z, Guo L. Predicting circRNA–Disease Associations Based on Improved Collaboration Filtering Recommendation System With Multiple Data. Frontiers in Genetics. 2019;10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref050] 50.Chen W, Lin H, Chou K-C. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Molecular BioSystems. 2015;11(10):2620–34. doi: 10.1039/c5mb00155b [DOI] [PubMed] [Google Scholar]

[pcbi.1011214.ref051] 51.Torres M, Yang H, Romero AE, Paccanaro A. Protein function prediction for newly sequenced organisms. Nature Machine Intelligence. 2021;3(12):1050–60. doi: 10.1038/s42256-021-00419-7 [DOI] [Google Scholar]

[pcbi.1011214.ref052] 52.Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biology. 2019;20(1):244. doi: 10.1186/s13059-019-1835-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref053] 53.Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422–9. doi: 10.1093/bioinformatics/btz595 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref054] 54.Cao Y, Shen Y. TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding. Bioinformatics. 2021;37(18):2825–33. doi: 10.1093/bioinformatics/btab198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011214.ref055] 55.Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology. 2016;34(2):184–91. doi: 10.1038/nbt.3437 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo

Hongliang Li

Bin Liu

Roles

Abstract

Author summary

1. Introduction

Fig 1.

Table 1. The comparisons between BioSeq-Diabolo and BioSeq-BLM.

2. Materials and methods

2.1. Biological sequence similarity analysis tasks

Fig 2. The analogy between biological sequence similarity analysis tasks and natural language processing tasks.

2.2. Biological sequence similarity analysis based on biological language semantics

Embeddings

Fig 3. The architectures of biological sequence similarity analysis methods based on biological language semantics.

Distribution methods

Representation methods

Interaction methods

2.3. BioSeq-Diabolo

Fig 4. BioSeq-Diabolo schematic overview.

Learning to rank

Performance evaluation

Result analysis

2.4. Web server and stand-alone package of BioSeq-Diabolo

Web server

Stand-alone package

3. Results and discussion

3.1. BioSeq-Diabolo facilitates the protein remote homology detection

Fig 5. The visualization results automatically generated by BioSeq-Diabolo for protein remote homology detection.

Table 2. The performance of BioSeq-Diabolo compared with competing methods for protein remote homology detection.

Fig 6. The training time and test time of BioSeq-Diabolo trained and tested with different number of samples.

3.2. BioSeq-Diabolo facilitates the circRNA-disease association identification

Fig 7. The visualization results automatically generated by BioSeq-Diabolo for circRNA-disease association identification.

Table 3. The performance of BioSeq-Diabolo compared with competing methods for circRNA-disease association identification.

3.3. BioSeq-Diabolo facilitates the protein function annotation

Fig 8. The visualization results automatically generated by BioSeq-Diabolo for protein function annotation.

Table 4. The performance of BioSeq-Diabolo compared with competing methods in Cellular Component Ontology (CCO) for protein function annotation.

4. Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

William Stafford Noble

Maxwell Wing Libbrecht

Roles

Author response to Decision Letter 0

Decision Letter 1

William Stafford Noble

Maxwell Wing Libbrecht

Roles

Author response to Decision Letter 1

Decision Letter 2

William Stafford Noble

Maxwell Wing Libbrecht

Roles

Author response to Decision Letter 2

Decision Letter 3

William Stafford Noble

Maxwell Wing Libbrecht

Roles

Acceptance letter

William Stafford Noble

Maxwell Wing Libbrecht

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases