Skip to main content
Molecular Therapy. Nucleic Acids logoLink to Molecular Therapy. Nucleic Acids
. 2018 Mar 9;11:337–344. doi: 10.1016/j.omtn.2018.03.001

A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information

Hai-Cheng Yi 1,2,4, Zhu-Hong You 1,4,, De-Shuang Huang 3,∗∗, Xiao Li 1, Tong-Hai Jiang 1, Li-Ping Li 1
PMCID: PMC5992449  PMID: 29858068

Abstract

The interactions between non-coding RNAs (ncRNAs) and proteins play an important role in many biological processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological techniques are used to identify protein molecules bound with specific ncRNA, but they are usually expensive and time consuming. Deep learning provides a powerful solution to computationally predict RNA-protein interactions. In this work, we propose the RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and feed them into a random forest (RF) model to predict ncRNA binding proteins. Stacked assembling is further used to improve the accuracy of the proposed method. Four benchmark datasets, including RPI2241, RPI488, RPI1807, and NPInter v2.0, were employed for the unbiased evaluation of five established prediction tools: RPI-Pred, IPMiner, RPISeq-RF, lncPro, and RPI-SAN. The experimental results show that our RPI-SAN model achieves much better performance than other methods, with accuracies of 90.77%, 89.7%, 96.1%, and 99.33%, respectively. It is anticipated that RPI-SAN can be used as an effective computational tool for future biomedical researches and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for biological research.

Keywords: RNA-protein interactions, non-coding RNA, deep learning, stacked auto-encoder, PSSM, Zernike moment

Introduction

In the Human genome, 74.7% of the sequence can be transcribed into RNA, but the total exon sequence of the mRNA is only 2.94%.1, 2, 3 The remaining sequence information is output in the form of non-coding RNA (ncRNA), which can be divided into two types: constitutive and regulatory types.4 The proportion of small molecule ncRNA in constitutive ncRNA and regulatory ncRNA is very small in non-coding sequences, and most of the non-coding sequences are transcribed into long ncRNA (lncRNA). Compared with mRNA, lncRNA is shorter in length, less in exon and two in focus, with an average abundance of about 1/10 of mRNA and a lower sequence conservation.5, 6, 7 It has been found that lncRNA can participate in all aspects of gene expression regulation by interacting with proteins such as chromatin modification complexes and transcription factors, thus playing a fundamental role in a variety of important biological processes such as X chromosome inactivation (Xist8 and Tsix9), gene imprinting (H1910 and Air11), and developmental differentiation (HOTAIR12 and TINCR13). Although the role of ncRNA-protein interactions (ncRPIs) in the regulation of gene expression has been doubtless, only a small number of ncRNA functions and mechanisms of action have been studied. Since ncRNA functions require the coordination of protein molecules, the identification of protein molecules bound with specific ncRNA has become the main approach to revealing the function and mechanism of ncRNA.

Large-scale RNA-binding proteins (RBPs) detection experiments based on biological methods have made many important advances,14, 15, 16 such as RNAcompete,17 HITS-CLIP,18 and RNA-protein complex structure, which provide valuable information about the RNA-protein interactions (RPIs), while experimental methods are still time-consuming and overpriced (for example, it’s high-cost to determine complex structure by way of experiment). These high-throughput technologies need much time for the abortive hand-tuning of putative binding sequences.19 A lot of studies suggest that the sequences have enough information for predicting RPIs. The sequence-homology-based methods help to detect the binding domains of proteins and their possible functions,20, 21, 22, 23, 24 but lack the ability to determine whether a given pair of RNA and protein can form the interaction well. There is an urgent need for an accurately computational approach to predicting RPIs.

In recent years, computational prediction of the interaction partner between proteins and RNAs has attracted a lot of research works.15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 Pancaldi et al.36, 37 trained a random forest (RF) and a support vector machine to classify whether the RNA-protein pair interact or not and used >100 different sources of features, which were extracted from genomic context,38 structure, or localization. The RPISeq21 was introduced by Muppirala et al.39 They also applied RF and SVM classifiers by using simple 4-mer features of RNA and 3-mer features of proteins, respectively. Thereafter, lncPro25 trained three types of physiochemical properties using Fisher linear discriminant. Zhou et al.20 presented a new SVM based approach RPI-Pred by taking into consideration both sequences and structures information to predict ncRPIs. In the studies above, hand-crafted features of RNA-protein pairs are used in some methods,38 which may change the real distribution back of the data and need strong domain knowledge. Other researchers extracted lowly discriminated features from noisy sequences, though they mainly got information from extracted sequences.21, 25, 26 General machine learning methods might not mine the hidden regular pattern from these noises well. Thus, efficient features and advanced models play an important role in RPI’s computational prediction.40, 41, 42, 43

In this study, we propose a powerful solution for these challenges. It’s a sequence-based approach to predict ncRPIs by using deep learning conjoint with RF classifier.44 More specifically, RNA sequences are first converted into k-mers sparse matrix,40 which retains almost all amino acid compositions and order information. Then the singular value decomposition (SVD) is used to extract the feature vector for each sequence.45 For protein sequences, a pseudo-Zernike moment (PZM) descriptor is used to extract the evolutionary information from the position-specific scoring matrix (PSSM).42, 46 Then, the stacked auto-encoder is further employed to automatically learn hidden high-level features from above mentioned features.47 Finally, these reprehensive features are fed into RF classifiers to predict RPIs. To further improve the robustness and accuracy of our method, extra layers are employed to integrate different predictors. In the experimental, the proposed method was evaluated on three benchmark datasets including RPI488,48 RPI1807,20 and RPI224121 and compared with other state-of-the-art methods, such as lncPro,25 RPISeq-RF,21 RPI-Pred,20 and IPMiner.48 The experimental results showed that our method can achieve much better prediction performance on above datasets.

Results and Discussion

In this study, we propose a deep learning method named RPI-SAN, which conjoins the stacked auto-encoder network (SAN) with RF classifiers and used PSSM with the Zernike moment and k-mers sparse matrix with SVD to predict the interactions of ncRNA-protein. First, we evaluate its predictive ability of RPIs on the RPI2241 dataset. Furthermore, we compare RPI-SAN with other state-of-the-art methods on different datasets to demonstrate the effectiveness and robustness of our approach. Then we predict ncRPIs on different datasets by using the trained model. Furthermore, we made a case study that shows, with specific examples, how RPI-SAN advanced studies regarding potential RPIs. Finally, we summarize, analyze, and discuss our method.

Evaluation of RPI-SAN’s Capability to Predict RPIs

We first test our RPI-SAN approach to evaluate its capability to predict RPIs on the RPI2241 dataset. The details listed in the Tables S2 and S3 are as follows.

The mean accuracy of 5-fold cross-validation is 90.77%, the mean sensitivity is 86.17%, the mean specificity is 97.37%, the mean precision is 84.05%, and the Matthews correlation coefficient (MCC) is 82.27%. Their respective SDs are 0.52%, 0.81%, 1.71%, 1.26%, and 1.25%. Table S2 shows the 5-fold cross-validation details performed by RPI-SAN on the RPI2241 dataset, with the area under the receiver operating characteristic curve (AUC) achieving 0.962 as shown in Figure 1. Our method has achieved the best performance on the RPI2241 dataset in all methods.

Figure 1.

Figure 1

Prediction Performance Comparison Between SA-FT-RF, SA-RF, RPISeq-RF, Average Assembling, and Stacked Assembling on ncRNA-Protein Dataset RPI2241

Our method is manifested by three stacked separate predictors, a stacked auto-encoder with fine-tuning (SA-FT-RF), a stacked auto-encoder with RF (SA-RF), and RPISeq with RF (RPISeq-RF), with each individual predictor performing different effects on different data. The stacked auto-encoder performs well in accuracy and specificity, while RPISeq-RF specializes in precision and sensitivity. It is explained that individual predictors have weaker adaptability. It is necessary to integrate them together to give play to each other’s strengths.

On the RPI2241 dataset, our RPI-SAN method performs much better than other predictors. Shown in Table S3, RPI-SAN performs at an accuracy of 90.77%, sensitivity of 86.17%, specificity of 97.37%, precision of 84.05%, MCC of 82.17%, and AUC of 0.962. It’s the best model in these four contrasting predictors. SA-RF performs at an accuracy of 63.71%, sensitivity of 64.75%, specificity of 61.72%, precision of 65.74%, and MCC of only 27.49%. The accuracy, sensitivity, specificity, precision, and MCC of RPISeq-RF are 63.96%, 64.83%, 62.59%, 65.37%, and 27.98%, and those of SA-FT-RF are 90.52%, 87.71%, 94.78%, 86.18%, and 81.56%. lncPro performs at an accuracy of 65.4%, sensitivity of 65.9%, specificity of 64.0%, precision of 66.9%, and MCC of 31.0%, respectively. lncPro performs a little worse than RPI-SAN. It has some disadvantages; it can only predict protein sequences longer than 30, which fails in predicting shorter protein sequences. Since using RNAsubopt to predict RNA structure takes a long time, especially for long sequences, it only processes the first 4,095 nucleotides if the RNA sequence is longer than 4,095. These are the reasons why our method does not include lncPro in our stacked predictors.

Comparison between Different Assembling Strategies

In our RPI-SAN method, we use stacked assembling to integrate different classifiers. This time, we compare it with other general methods, such as majority voting and averaging. As the results show in Figure 1, stacked assembling attains an AUC of 0.962 on the RPI2241 dataset, which is better than the average method and each individual classifier. Logistic regression gets different weights for the stacked auto-encoder, stacked auto-encoder with fine-tuning, and RPISeq-RF by using the raw sequence feature, which is more robust and flexible than the average stacked auto-encoder.

Different predictors play different roles in the production of the final result. Stacked assembling improves the final prediction effect at different ranges. On the RPI488 and RPI1807 datasets, the three predictors have outputs similar to the RPI2241 dataset, which means a stronger correlation. So stacked assembling improves the AUC on RPI488 and RPI1807, but smaller than the improvement on RPI2241, which has a lower correlation. As a result, the stacked assembling is really effective for improving the final performance. So it is more significant on datasets with lower correlation.

Comparison with Other Methods

In order to verify the effectiveness and robustness of RPI-SAN, we compare it with other state-of-the-art methods in the same datasets. Here we have selected the RPI-Pred from the study by Suresh et al.20 and the RPISeq-RF from the study by Muppirala et al.21 because the RPISeq-RF performs better than RPISeq-SVM in this study. We have also selected the IPMiner from the study by Pan et al.48 and the lncPro from the study by Lu et al.25 Since these methods are not evaluated on the same criteria, we only compare the results of the same evaluation methods on the same datasets.

As shown in Table 1 and Figure S1, on the RPI488 dataset, our method performs a little better than any other method, with an accuracy of 89.7%, sensitivity of 94.3%, specificity of 83.7%, precision of 95.2%, MCC of 79.3%, and AUC of 0.92. The performance of each parameter is optimal. For the RPI1807 dataset, all methods except RPI-Pred give great performances with the accuracy and AUC greater than 95% (shown in Figure S2). Our method also gives a great performance. Although the accuracy is not best, it still attains a high accuracy of 96%. In terms of specificity and the important parameter AUC, our method is outstanding, achieving an AUC of 0.999. For the RPI2241 dataset, before our proposed method RPI-SAN, most methods did not work very well, especially in terms of accuracy, MCC, and AUC. Compared with the best methods already published, RPI-SAN improved the accuracy by almost 7%, specificity by more than 16%, MCC by over 17%, and AUC by more than 6%, respectively.

Table 1.

Comparing RPI-SAN with Other Methods on RPI4888, RPI1807 and RPI2241 Datasets

Datasets Methods Accuracy (%) Sensitivity (%) Specificity (%) Precision (%) MCC (%) AUC
RPI488 IPMiner 89.1 93.9 83.1 94.5 78.4 0.914
RPISeq-RF 88.0 92.6 82.2 93.2 76.2 0.903
lncPro 87.0 90.0 82.7 91.0 74.0 0.901
RPI-SAN 89.7a 94.3a 83.7 95.2a 79.3a 0.920a
RPI1807 RPI-Pred 93.0 95.0 N/A 94.0 N/A 0.97
IPMiner 98.6a 98.2a 99.3 97.8a 97.2a 0.998
RPISeq-RF 97.3 96.8 98.4 96.0 94.6 0.0996
lncPro 96.9 96.5 98.1 95.5 93.8 0.994
RPI-SAN 96.1 93.6 99.9 91.4 92.4 0.999a
RPI2241 RPI-Pred 84.0 78.0 N/A 88.0a N/A 0.89
IPMiner 82.4 83.3 81.2 83.6 65.0 0.906
RPISeq-RF 63.96 64.83 62.59 65.37 27.98 0.690
lncPro 65.4 65.9 64.0 66.9 31.0 0.722
RPI-SAN 90.77a 86.17a 97.37a 84.05 82.27a 0.962a

aThis measure of performance is the best among the compared methods for the individual dataset.

Predicting ncRPIs Using RPI-SAN

To further validate the ability of RPI-SAN to predict the interactions between ncRNA and protein, we use the RPI488 dataset to train the deep learning model and verify it on the NPInter v2.0 dataset.49 There is no overlap between the two datasets. There are 10,412 interaction pairs in the NPInter v2.0, which can be divided into six organisms, and we conduct experiments on them separately. The results are shown in Table 2. RPI-SAN predicts the correct number of pairs of interactions on Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus, and Escherichia coli for 6,928, 29, 90, 897, 2,153, and 177, with an accuracy of 99.33%, 80.56%, 98.90%, 98.56%, 97.95%, and 87.62%, respectively. We finally predict that the correct number of ncRNA-protein pairs is 10,274, with a total accuracy of 98.67% on the independent dataset NPInter v2.0.

Table 2.

Predicted Performance of the RPI488 Trained Model on NPInter v2.0 Dataset

Organism Number of Interaction Pairs Predicted Number of Interaction Pairs Accuracy (%)
Homo sapiens 6,975 6,928 99.33
Caenorhabditis elegans 36 29 80.56
Drosophila melanogaster 91 90 98.90
Saccharomyces cerevisiae 910 897 98.56
Mus musculus 2,198 2,153 97.95
Escherichia coli 202 177 87.62
Total 10,412 10,274 98.67

Case Study: Potential RPIs of the Top-15 Ranks Verified from Database

After evaluating the effectiveness and robustness of the proposed model, we calculate the possibility of interaction for potential RNA-protein pairs in the dataset of Homo sapiens. The training data do not overlap with the testing data. The predicted RNA-protein pairs with high probability are considered as potential interacted pairs and further verified by Gene Ontology.50 As a result, shown in Table 3, 15 interacted RNA-protein pairs are finally confirmed. Note that the high-ranked interactions that are not reported yet may also exist in reality. Based on these results, we anticipate that the proposed model is feasible to predict new RPIs.

Table 3.

Confirmed RNA-Protein Interactions with High Ranks in the Dataset of Homo sapiens

Protein ID RNA ID Probability
HNRNPA1 EPB41 0.867
TARDBP CFTR 0.866
MBNL1 DMPK 0.863
PTBP1 CD40LG 0.859
SRP19 RN7SL1 0.857
SRSF1 TNNT2 0.856
ELAVL4 MYCN 0.853
ELAVL2 ID1 0.851
HNRNPC CSF2 0.848
HNRNPD ADRB1 0.847
EIF5A RNU6-1 0.845
HNRNPD AGTR1 0.842
ELAVL3 VEGFA 0.838
YBX1 CSF2 0.833
ZBP1 ACTB 0.831

Conclusions

In this study, we have proposed the computational method RPI-SAN based on deep learning with efficient features and stacked assembling to predict RPIs. We use PSSM and k-mers sparse matrix to extract efficient features from proteins and RNAs, respectively. Then such features will be fed into the SAN with RF predictors. The presented method gives a high performance with an accuracy of 90.77%, MCC of 82.27%, and an excellent AUC of 96.2% on the RPI2241 dataset. RPI-SAN also performs well on other previous popular datasets. Experimental results prove that the stacked auto-encoder can learn high-level features automatically from raw information, which is important for designing machine learning models. RPI-SAN gives a great performance on both RNA-protein and ncRPI prediction, which can prove that RPI-SAN is better than other state-of-the-art methods in some aspects. Through experiments, we also find that RPI-SAN has a better effect on large-scale datasets than small datasets, which we will keep studying in further work. We researched the computational techniques for predicting the interaction of ncRNA-proteins because it is more convenient and rapid than traditional hand-tuning experiments and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for the further biological researches.

Materials and Methods

Construction of Datasets

To evaluate the effectiveness and robustness of our approach, we conducted experiments on four different benchmark datasets, including RPI488, RPI1807, RPI2241, and NPInter v2.0.49 The RPI488 is a non-redundant lncRPI dataset based on structure complexes,51, 52 which contains 488 lncRNA-protein pairs, including 245 non-interacting pairs and 243 interacting pairs. Here it is smaller than other RNA-protein datasets, with only 243 lncRPIs. The reason is that there are much fewer lncRNA-protein complexes in the Protein Data Bank (PDB)53 database, where the ncRNA-protein complexes are downloaded from.54 The dataset RPI1807 contains 1,807 positive ncRPI pairs, including 1,078 RNA chains and 1,807 protein chains. The number of negative ncRPI pairs is 1,436, which contain 493 RNA chains and 1,436 protein chains. It is established by parsing the Nucleic Acid Database (NAD), which provides the RNA-protein complex data and protein-RNA interface data. The RPI2241 dataset is constructed in a similar way and contains 2,241 interacting RNA-protein pairs. The NPInter v2.0 is an ncRPI from a non-structure-based source, containing 10,412 ncRNA-protein pairs and 449 chains of protein and 4,636 chains of ncRNA. Table 4 shows the details of the datasets used in this study.

Table 4.

The Details of the ncRNA-Protein Interaction Datasets

Dataset Interaction Pairs Number of Proteins Number of RNAs
RPI488 243 25 247
RPI1807 1,807 1,807 1,078
RPI2241 2,241 2,043 332
NPInter v2.0 10,412 449 4,636

RPI488 is lncRNA-protein interactions based on structure complexes. PI369, RPI2241, and RPI1807 are RNA-protein interactions. NPInter2.0 and RPI13254 are ncRNA-protein interactions from non-structure-based source.

Representation of the ncRNA and Protein Sequences

To obtain high effective features for deep learning models, each ncRNA-protein pair is represented as 486-feature vectors, in which 256 features are used to encode the RNA sequence, and 240 features are used to encode the protein sequence. RNAs are encoded by using the k-mers sparse matrix previously proposed in Zhu-Hong et al.40 In this method, we scan each RNA sequence (A, C, G, U) from left to right, stepping one nucleotide at a time, which is considered the characteristic of each nucleotide. Its k-1 consecutive nucleotides and k consecutive nucleotides are regarded as a unit. For any above-mentioned RNA sequences of length L, there would be 4k different possible k-mers and Lk+1 k-mers appearing in the RNA sequence.

Each input of the RNA sequence is processed into a 4k×(Lk+1) k-mers sparse matrix R. When RjRj+1Rj+2Rj+3 are just equal to the ith k-mers among 4k different k-mers, set the element aij = 1. The rest can be dealt with in the same way. Then an input RNA sequence is converted into a 4k×(Lk+1) matrix R. In this study, the value of k is set to 4 to process the RNA sequence, which can be obtained from Table S1.

M=(aij)4k×(Lk+1) (1)
aij={1,ifRjRj+1Rj+2Rj+3=kmer(i)0,else (2)

The 4-mer sparse matrix R is a low-rank matrix, while almost all of the information is retained, including sequence (AAAA, AAAC …UUUU) frequency, position, and order-hidden information in a protein sequence. Then, we use SVD to process a matrix R into a 1×256 vector feature.

Considering that RNA and protein sequences have different structures for protein amino acids sequences, we use a more biological method, the PSSM, to transform it. The PSSM algorithm containing biological evolution information was first used to detect distantly related protein, achieving great success in the prediction of the protein secondary structure and the protein binding site and the disordered regions prediction. The structure of PSSM is a L×20 matrix, while L rests with the length of the input protein sequence and 20 represents the number of naive amino acids. Supposing p={b(i,j),i=1,2,Nandj=1,2,20}, PSSM is represented as follows

P= [b1,1b1,20bN,1bN,20], (3)

where bi,j in the i row of PSSM represents the probability of the ith residue being mutated into type j of 20 native amino acids during the procession of evolutionary in the protein from multiple sequence alignments. In experiments, we used the position-specific iterated BLAST (PSI-BLAST) tool to convert protein raw sequence into PSSM. We set the PSI-BLAST tool against the database of SwissProt, the number of iteration as 3, and err-value to 0.001, to get the best results. Both PSI-BLAST applications and the SwissProt database can be freely downloaded from http://blast.ncbi.nlm.nih.gov/Blast.cgi.

Then we extracted the PZM41 features from the PSSM. PZM is widely used in the field of image processing and has achieved good results, which can extract features from the matrix more robustly and has less information redundancy. We set the PZM required parameter n, m = 30. Finally, a feature vector is obtained for each protein sequence.

SAN

Deep learning as a powerful vehicle has been widely used in different areas19, 22, 23, 43, 55, 56 and has received great attention in the field of ncRPI prediction.57 Among these several deep-learning architectures, the SAN is more appropriate to our demand. The stacked auto-encoder has almost all the advantages of the deep neural network (DNN) and has an outstanding expressive ability. It is usually able to obtain the “hierarchical grouping” and “partial-global decomposition” features of the raw data. Since the stacked auto-encoder tends to be able to effectively represent the original input data, we use auto-encoder as a component element of a DNN with multiple layers.44, 55, 58

The SAN is composed of a multilayer neural network sparse auto-encoder and the output from the previous layer as input of the next layer as shown in Figure 2. With hyper parameter optimization, we get the best parameters of the stacked auto-encoder neural network. The sparse auto-encoder network is constructed like Figure 3. Error represents the error between the reconstructed data and the input, while the sparsity penalty stands for regularity limit for L1, which constrains the majority of each layer’s node, which is 0, with only a few that are not 0.

Figure 2.

Figure 2

The Construction of Stacked Auto-Encoder Network

Figure 3.

Figure 3

The Construction of Sparse Auto-Encoder

Where the input x is in the form of d-dimension and the auto-encoder network maps X into the output h(X):

h(w,b)(x)=f(WTX)=f(i=1nwixi+b), (4)

where the f is activation function. When we select Sigmoid as the activation function,

f(z)=11+ez, (5)

then the loss function is as follows:

L(X,W)=WhX2+λj|hj|. (6)

Usually, each layer of neural network includes a certain number of neurons. Then, the multilayer neural network makes up a stacked network of sequential connected layers, while the output of the previous layers is the input of the next layers:

a(l)=f(z(l)) (7)
z(l+1)=W(l,1)a(l)+b(l,1). (8)

Among them, a(n) is the activation value of the deepest hidden unit, which is a higher order representation of the input value. By using a(n) as the input feature of the softmax classifier, the features learned in the deep auto-encoder network can be used for classification problems. We use the stochastic gradient descent (SGD)59 to optimize the reconstruction error between X and z, which can be measured by using the squared error.

Stacking multiple auto-encoders47 consists of a stacked auto-encoder, a DNN that can learn high-level features automatically.60, 61 To get a better performance, we use greedy layer-wise learning, which can train each layer individually to optimize objective functions when learning the stacked auto-encoder parameters. In our network, we use two types of layers: full-connected and dropout layers.62 For the dropout layer, it set some node activations to 0 with a certain probability to avoid over-fitting for model training. We also add an extra soft-max layer for fine tuning, with the ReLu function as activation for the outputs from the conjoined multiple- layer network of RNA and protein as the last hidden layer, which is trained by using real label information to update weights and bias parameters for SAN.63, 64 Then we use SGD (with different learn rates and momentums for different datasets) to minimize cross entropy loss function and Adam to minimize mean squared error for each de-noising auto-encoder layer, and the dropout probability is set to 0.5 during the model training.65, 66 In this study, we use the keras library to implement the stacked auto-encoder and set the parameters batch_size and nb_epoch to 100, respectively. The details about keras can be found at http://github.com/fchollet/keras.

Stacked Assembling

Ordinarily, different classifiers have different performances in different datasets. In fact, there is no single classifier that can be adapted to all kinds of datasets. An extra-stacked assembling layer is used in our deep learning network to integrate the individual multiple classifier outputs to gain the approximate optimal target function. Previous works have proposed majority voting36 and average individual classifiers outputs.67

In our study, using multiple layer neural networks following the deep learning intuition, we define the operating mechanism as the level 0 classifiers’ outputs that will be fed into the level 1 classifier as training data. Where level 0 is the original layer and level 1 the next sequential layer, how to obtain the outputs from separate classifiers will be worked out. In our network, the outputs of the level 0 layer classifiers are the predicted probability score, while the successive level 1 classifier is logistic regression. When the weight of logistic regression for each individual classifier is the same, it degenerates to average treatment. When only one weight is not zero, it is more like a majority voting method:

Pw(±1|p)=11+ewTp(±1|p), (9)

where p is the probability score vector outputs of the individual classifiers and w is the weight vector for every single different classifier. The logistic regression is from Scikit-learn.68

Performance Evaluation

In this study, we trained the deep learning model to classify whether ncRNA and protein interact with each other or not. The 5-fold cross-validation method is used to evaluate the performance of our study, which randomly divides all the datasets into five equal parts. In each validation, one of them is taken as the testing set, and the other four parts are taken as the training set. The testing and training data do not overlap with each other to guarantee the unprejudiced comparison. We take the average and SDs of these results as the final validation result. We follow the widely used evaluation measure to evaluate our method, including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and MCC defined as:

Acc.=TN+TPTN+TP+FN+FP (10)
Sen.=TPTP+FN (11)
Spec. =TNTN+FP (12)
Prec.=TPTP+FP (13)
 MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN), (14)

where TN indicates the correctly predicted negative number, TP denotes the correctly predicted positive number, FN represents the wrongly predicted negative number, and FP stands for the wrongly predicted positive number. Certainly, the receiver operating characteristic (ROC) curve and the area under ROC curve (AUC) are also exploited to evaluate the performance of classifiers.

Author Contributions

H-C.Y. and Z-H.Y. conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. D-S.H., X.L., T-H.J., and L-P.L. wrote the manuscript and analyzed experiments. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grants 61732012, 61520106006, and 61722212 and in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences.

Footnotes

Supplemental Information includes two figures and three tables and can be found with this article online at https://doi.org/10.1016/j.omtn.2018.03.001.

Contributor Information

Zhu-Hong You, Email: zhuhongyou@ms.xjb.ac.cn.

De-Shuang Huang, Email: dshuang@tongji.edu.cn.

Supplemental Information

Document S1. Figures S1 and S2 and Tables S1–S3
mmc1.pdf (429.2KB, pdf)
Document S2. Article plus Supplemental Information
mmc2.pdf (1MB, pdf)

References

  • 1.Taft R.J., Pheasant M., Mattick J.S. The relationship between non-protein-coding DNA and eukaryotic complexity. BioEssays. 2007;29:288–299. doi: 10.1002/bies.20544. [DOI] [PubMed] [Google Scholar]
  • 2.Esteller M. Non-coding RNAs in human disease. Nat. Rev. Genet. 2011;12:861–874. doi: 10.1038/nrg3074. [DOI] [PubMed] [Google Scholar]
  • 3.Li J.H., Liu S., Zhou H., Qu L.H., Yang J.H. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data. Nucleic Acids Res. 2014;42:D92–D97. doi: 10.1093/nar/gkt1248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Djebali S., Davis C.A., Merkel A., Dobin A., Lassmann T., Mortazavi A., Tanzer A., Lagarde J., Lin W., Schlesinger F. Landscape of transcription in human cells. Nature. 2012;489:101–108. doi: 10.1038/nature11233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Consortium T.E.P., ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Derrien T., Johnson R., Bussotti G., Tanzer A., Djebali S., Tilgner H., Guernec G., Martin D., Merkel A., Knowles D.G. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012;22:1775–1789. doi: 10.1101/gr.132159.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Harrow J., Frankish A., Gonzalez J.M., Tapanari E., Diekhans M., Kokocinski F., Aken B.L., Barrell D., Zadissa A., Searle S. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Brown C.J., Ballabio A., Rupert J.L., Lafreniere R.G., Grompe M., Tonlorenzi R., Willard H.F. A gene from the region of the human X inactivation centre is expressed exclusively from the inactive X chromosome. Nature. 1991;349:38–44. doi: 10.1038/349038a0. [DOI] [PubMed] [Google Scholar]
  • 9.Lee J.T., Davidow L.S., Warshawsky D. Tsix, a gene antisense to Xist at the X-inactivation centre. Nat. Genet. 1999;21:400–404. doi: 10.1038/7734. [DOI] [PubMed] [Google Scholar]
  • 10.Brannan C.I., Dees E.C., Ingram R.S., Tilghman S.M. The product of the H19 gene may function as an RNA. Mol. Cell. Biol. 1990;10:28–36. doi: 10.1128/mcb.10.1.28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sleutels F., Zwart R., Barlow D.P. The non-coding Air RNA is required for silencing autosomal imprinted genes. Nature. 2002;415:810–813. doi: 10.1038/415810a. [DOI] [PubMed] [Google Scholar]
  • 12.Rinn J.L., Kertesz M., Wang J.K., Squazzo S.L., Xu X., Brugmann S.A., Goodnough L.H., Helms J.A., Farnham P.J., Segal E., Chang H.Y. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 2007;129:1311–1323. doi: 10.1016/j.cell.2007.05.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kretz M., Siprashvili Z., Chu C., Webster D.E., Zehnder A., Qu K., Lee C.S., Flockhart R.J., Groff A.F., Chow J. Control of somatic tissue differentiation by the long non-coding RNA TINCR. Nature. 2013;493:231–235. doi: 10.1038/nature11661. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Khorshid M., Rodak C., Zavolan M. CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins. Nucleic Acids Res. 2011;39:D245–D252. doi: 10.1093/nar/gkq940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Huang Y.A., Chan K., You Z.H. Constructing prediction models from expression profiles for large scale lncRNA-miRNA interaction profiling. Bioinformatics. 2018;34:812–819. doi: 10.1093/bioinformatics/btx672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li Z., Han P., You Z.H., Li X., Zhang Y., Yu H., Nie R., Chen X. In silico prediction of drug-target interaction networks based on drug chemical structure and protein sequences. Sci. Rep. 2017;7:11174. doi: 10.1038/s41598-017-10724-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ray D., Kazan H., Chan E.T., Peña Castillo L., Chaudhry S., Talukder S., Blencowe B.J., Morris Q., Hughes T.R. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol. 2009;27:667–670. doi: 10.1038/nbt.1550. [DOI] [PubMed] [Google Scholar]
  • 18.Licatalosi D.D., Mele A., Fak J.J., Ule J., Kayikci M., Chi S.W., Clark T.A., Schweitzer A.C., Blume J.E., Wang X. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008;456:464–469. doi: 10.1038/nature07488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Alipanahi B., Delong A., Weirauch M.T., Frey B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015;33:831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
  • 20.Suresh V., Liu L., Adjeroh D., Zhou X. RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information. Nucleic Acids Res. 2015;43:1370–1379. doi: 10.1093/nar/gkv020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Muppirala U.K., Honavar V.G., Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics. 2011;12:489. doi: 10.1186/1471-2105-12-489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Liu B., Fang L., Long R., Lan X., Chou K.C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016;32:362–369. doi: 10.1093/bioinformatics/btv604. [DOI] [PubMed] [Google Scholar]
  • 23.Liu B., Yang F., Huang D.S., Chou K.C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2017;34:33–40. doi: 10.1093/bioinformatics/btx579. [DOI] [PubMed] [Google Scholar]
  • 24.Liu B., Zhang D., Xu R., Xu J., Wang X., Chen Q., Dong Q., Chou K.C. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014;30:472–479. doi: 10.1093/bioinformatics/btt709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lu Q., Ren S., Lu M., Zhang Y., Zhu D., Zhang X., Li T. Computational prediction of associations between long non-coding RNAs and proteins. BMC Genomics. 2013;14:651. doi: 10.1186/1471-2164-14-651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bellucci M., Agostini F., Masin M., Tartaglia G.G. Predicting protein associations with long noncoding RNAs. Nat. Methods. 2011;8:444–445. doi: 10.1038/nmeth.1611. [DOI] [PubMed] [Google Scholar]
  • 27.Agostini F., Zanzoni A., Klus P., Marchese D., Cirillo D., Tartaglia G.G. catRAPID omics: a web server for large-scale prediction of protein-RNA interactions. Bioinformatics. 2013;29:2928–2930. doi: 10.1093/bioinformatics/btt495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Livi C.M., Blanzieri E. Protein-specific prediction of mRNA binding using RNA sequences, binding motifs and predicted secondary structures. BMC Bioinformatics. 2014;15:123. doi: 10.1186/1471-2105-15-123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wang Y., You Z., Li X., Chen X., Jiang T., Zhang J. PCVMZM: Using the probabilistic classification vector machines model combined with a zernike moments descriptor to predict protein-protein interactions from protein sequences. Int. J. Mol. Sci. 2017;18:1029. doi: 10.3390/ijms18051029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang Y.B., You Z.H., Li X., Jiang T.H., Chen X., Zhou X., Wang L. Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Mol. Biosyst. 2017;13:1336–1344. doi: 10.1039/c7mb00188f. [DOI] [PubMed] [Google Scholar]
  • 31.Li J.Q., You Z.H., Li X., Ming Z., Chen X. PSPEL: In silico prediction of self-interacting proteins from amino acids sequences using ensemble learning. IEEE/ACM Trans Comput Biol Bioinform. 2017;14:1165–1172. doi: 10.1109/TCBB.2017.2649529. [DOI] [PubMed] [Google Scholar]
  • 32.Liu B., Chen J., Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015;31:3492–3498. doi: 10.1093/bioinformatics/btv413. [DOI] [PubMed] [Google Scholar]
  • 33.Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.C. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(Web Server issue):W65–W71. doi: 10.1093/nar/gkv458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.You Z.H., Huang Z.A., Zhu Z., Yan G.Y., Li Z.W., Wen Z., Chen X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput. Biol. 2017;13:e1005455. doi: 10.1371/journal.pcbi.1005455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chen X., Yan C.C., Zhang X., You Z.H. Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinform. 2017;18:558–576. doi: 10.1093/bib/bbw060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Breiman L. Random Forest. Mach. Learn. 2001;45:5–32. [Google Scholar]
  • 37.Vapnik V.F. Statistical Learning Theory. In: Seel N.M., editor. Encyclopedia of the Sciences of Learning 41.4. Springer; 1998. p. 3185. [Google Scholar]
  • 38.Pancaldi V., Bähler J. In silico characterization and prediction of global protein-mRNA interactions in yeast. Nucleic Acids Res. 2011;39:5826–5836. doi: 10.1093/nar/gkr160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Shen J., Zhang J., Luo X., Zhu W., Yu K., Chen K., Li Y., Jiang H. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA. 2007;104:4337–4341. doi: 10.1073/pnas.0607879104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhu-Hong Y., MengChu Z., Xin L., Shuai L. Highly efficient framework forpredicting interactions between proteins. IEEE Trans. Cybern. 2017;47:731–743. doi: 10.1109/TCYB.2016.2524994. [DOI] [PubMed] [Google Scholar]
  • 41.Haddadnia J., Ahmadi M., Faez K. An efficient feature extraction method with pseudo-zernike moment in RBF neural network-based human face recognition system. EURASIP J. Adv. Signal Process. 2003;2003:1–12. [Google Scholar]
  • 42.Ahmad S., Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. doi: 10.1186/1471-2105-6-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Maaloe L., Arngren M., Winther O. Deep belief nets for topic modeling. Comput. Sci. 2015 [Google Scholar]
  • 44.LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
  • 45.Lathauwer L.D., Moor B.D., Vandewalle J. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 2000;21:1253–1278. [Google Scholar]
  • 46.Jeong J.C., Lin X., Chen X.W. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2011;8:308–315. doi: 10.1109/TCBB.2010.93. [DOI] [PubMed] [Google Scholar]
  • 47.Vincent P., Larochelle H., Lajoie I., Bengio Y., Manzagol P.A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010;11:3371–3408. [Google Scholar]
  • 48.Pan X., Fan Y.X., Yan J., Shen H.B. IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction. BMC Genomics. 2016;17:582. doi: 10.1186/s12864-016-2931-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yuan J., Wu W., Xie C., Zhao G., Zhao Y., Chen R. NPInter v2.0: an updated database of ncRNA interactions. Nucleic Acids Res. 2014;42:D104–D108. doi: 10.1093/nar/gkt1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., The Gene Ontology Consortium Gene ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Huang Y., Niu B., Gao Y., Fu L., Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–682. doi: 10.1093/bioinformatics/btq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lewis B.A., Walia R.R., Terribilini M., Ferguson J., Zheng C., Honavar V., Dobbs D. PRIDB: a Protein-RNA interface database. Nucleic Acids Res. 2011;39:D277–D282. doi: 10.1093/nar/gkq1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Puton T., Kozlowski L., Tuszynska I., Rother K., Bujnicki J.M. Computational methods for prediction of protein-RNA interactions. J. Struct. Biol. 2012;179:261–268. doi: 10.1016/j.jsb.2011.10.001. [DOI] [PubMed] [Google Scholar]
  • 55.Bengio Y., Courville A., Vincent P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35:1798–1828. doi: 10.1109/TPAMI.2013.50. [DOI] [PubMed] [Google Scholar]
  • 56.Zhou J., Troyanskaya O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Cook K.B., Hughes T.R., Morris Q.D. High-throughput characterization of protein-RNA interactions. Brief. Funct. Genomics. 2015;14:74–89. doi: 10.1093/bfgp/elu047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hinton G.E., Salakhutdinov R.R. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
  • 59.Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., and Ng, A.Y. (2012). Building high-level features using large scale unsupervised learning. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38115.pdf.
  • 60.Ramsundar B., Kearnes S., Riley P., Webster D., Konerding D., Pande V. Massively Multitask Networks for Drug Discovery. Comput. Sci. 2015 [Google Scholar]
  • 61.McHugh C.A., Russell P., Guttman M. Methods for comprehensive experimental identification of RNA-protein interactions. Genome Biol. 2014;15:203. doi: 10.1186/gb4152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014;15:1929–1958. [Google Scholar]
  • 63.Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 15, 315–323.
  • 64.Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems 1, 1097–1105.
  • 65.Dahl, G.E., Sainath, T.N., and Hinton, G.E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. 1988 International Conference on Acoustics, Speech and Signal Processing 26, 8609–8613.
  • 66.Kingma D.P., Ba J. Adam: a method for stochastic optimization. Comput. Sci. 2014 [Google Scholar]
  • 67.Pan X.Y., Tian Y., Huang Y., Shen H.B. Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics. 2011;97:257–264. doi: 10.1016/j.ygeno.2011.03.001. [DOI] [PubMed] [Google Scholar]
  • 68.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 2013;12:2825–2830. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1 and S2 and Tables S1–S3
mmc1.pdf (429.2KB, pdf)
Document S2. Article plus Supplemental Information
mmc2.pdf (1MB, pdf)

Articles from Molecular Therapy. Nucleic Acids are provided here courtesy of The American Society of Gene & Cell Therapy

RESOURCES