Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2017 Jan 24;2017:5043984. doi: 10.1155/2017/5043984

Utilizing Selected Di- and Trinucleotides of siRNA to Predict RNAi Activity

Ye Han 1,2, Yuanning Liu 1,2, Hao Zhang 1,2, Fei He 3,4,5, Chonghe Shu 1,2, Liyan Dong 1,2,*
PMCID: PMC5294759  PMID: 28243313

Abstract

Small interfering RNAs (siRNAs) induce posttranscriptional gene silencing in various organisms. siRNAs targeted to different positions of the same gene show different effectiveness; hence, predicting siRNA activity is a crucial step. In this paper, we developed and evaluated a powerful tool named “siRNApred” with a new mixed feature set to predict siRNA activity. To improve the prediction accuracy, we proposed 2-3NTs as our new features. A Random Forest siRNA activity prediction model was constructed using the feature set selected by our proposed Binary Search Feature Selection (BSFS) algorithm. Experimental data demonstrated that the binding site of the Argonaute protein correlates with siRNA activity. “siRNApred” is effective for selecting active siRNAs, and the prediction results demonstrate that our method can outperform other current siRNA activity prediction methods in terms of prediction accuracy.

1. Introduction

RNA interference (RNAi) is a cellular process whereby double-stranded RNA (dsRNA) leads to posttranscriptional gene silencing through base-pairing interactions and is found in many eukaryotic systems, including plants, fungi, invertebrates, and mammals [14]. In mammalian cells, long dsRNA is processed into short 21–23 nucleotide (nt) dsRNAs known as small interfering RNA (siRNA) and induces instant target gene knockdown [3]. In functional genomic research, RNAi has become very helpful in drug and therapeutic applications [5]. Highly effective siRNAs can be synthesized to design novel drugs for influenza virus [6], HIV virus [7], and cancer [8]. However, Takayuki measured the RNAi activities of siRNAs targeting all positions of a single mRNA in human cells and found that few siRNAs show very high activities [9]. Therefore, predicting siRNA activity is a critical step for the successful implementation of RNAi.

Numerous siRNA-designing algorithms, which can be generally categorized as first-and second-generation algorithms, have been reported to date. The first-generation algorithms are based on small validated siRNA datasets and exploit multiple siRNA features, including GC content [10], base preferences at specific positions [11, 12], thermodynamic stability [13], internal structure [14], and target mRNA secondary structure [1517]. However, a large majority of siRNAs designed by the first-generation algorithms are not very effective [18]. The reason may be that the early datasets are too small to cover all the important features [19].

The second-generation algorithms were developed with the accumulation of validated siRNAs. Huesken developed “Biopredsi” [20] based on artificial neural network and built a major siRNA dataset including 2431 siRNAs through high-throughput analysis technology. A number of siRNA activity prediction algorithms based on machine learning models were built using Huesken's dataset. The algorithms ThermoComposition21 [21], DSIR [22], i-score [23], and Biopredsi were estimated as the best predictors [24]. In addition, Takayuki et al. proposed a complete dataset including the siRNAs targeting all positions of a single mRNA in human cells and developed an algorithm “siExplored.” They found that specific residues at every third position of siRNAs greatly influenced its RNAi activity [9].

The performance of second-generation algorithms heavily depends on the selection of the included features [25]. Because the siRNA sequence is the most important factor that determines RNAi activity, more potential features embedded in siRNA sequences should be exploited to increase prediction accuracy. Takahashi found that when the 2-3 bp RNA at every position of a siRNA sequence were substituted by DNA, the RNAi activity changed [26]. Thus, we consider that the di- and trinucleotides at certain positions of siRNA may correlate with its RNAi activity.

In this paper, we developed a powerful siRNA activity predictor by fusing multiple potential features. Our experimental results demonstrate that siRNA activity is significantly affected by its di- and trinucleotides; thus, we proposed 2-3NTs as our new features. In addition, a new mixed 230-dimensional feature set was formed by combining 191 traditional features and 39 new features. To select the most relevant features, we proposed a Binary Search Feature Selection (BSFS) algorithm. Finally, a Random Forest predictor is constructed using the selected features. At the same time, a user-friendly web server named siRNApred is developed and is available for free at http://www.jlucomputer.com:8080/RNA/. siRNApred showed better performance compared with first-generation and second-generation algorithms. The result suggests that the di- and trinucleotides of siRNA can provide important information for prediction of active siRNAs.

2. Materials and Methods

2.1. Dataset

Huesken's dataset includes [20] 2431 siRNAs targeted to 34 human and rodent mRNAs. The dataset is divided into the 2182-sequence training set (Huesken_train) and 249-sequence testing set (Huesken_test). Three independent datasets from Vickers, Reynolds, and Haborth, including 368 siRNAs, are used for testing [11, 27, 28].

2.2. The Importance of the Di- and Trinucleotides of siRNA

In this section, we first elucidated the importance of our proposed di- and trinucleotides of siRNA on its activity. The di- and trinucleotides of siRNA can be defined as follows:

  • The guide strand of siRNA S = a1, a2,…, ai,…, a21, where 1 ≤ i ≤ 21.

  • adad+1 represents the dinucleotide at position d, where 1 ≤ d ≤ 20.

  • atat+1at+2 represents the trinucleotide at position t, where 1 ≤ t ≤ 19.

All di- and trinucleotides at all positions of siRNA are obtained by a sliding window size of 2-3. Huesken's dataset is divided into two classes: 1218 potent siRNAs with activities greater than 0.7 and 1213 nonpotent siRNAs with activities less than 0.7.

There are 16 2-mer RNA subsequences, that is, AA, AU, etc., and the frequencies of all 2-mer RNA subsequences at positions 1 to 20 are calculated for the two classes. The significance level is calculated by Student's t-test and the 2-mer RNA subsequences with minimal p value are shown in Table 1 (p-value < 0.05).

Table 1.

Primary dinucleotides with minimal p value.

Position Dinucleotide motif Freq
(P)
Freq
(N)
Type of corr. p value
1 UU1 178/1218 25/1213 Positive 9.45e − 30
GG1 36/1218 159/1213 Negative 1.52e − 20
2 UA2 73/1218 32/1213 Positive 4.62e − 5
GC2 48/1218 96/1213 Negative 3.26e − 5
3 AA3 76/1218 53/1213 Positive 0.0397
CC3 57/1218 91/1213 Negative 0.0036
4 UU4 111/1218 69/1213 Positive 0.0013
CC4 60/1218 107/1213 Negative 0.0001
5 AU5 94/1218 56 /1213 Positive 0.0015
CC5 66/1218 102/1213 Negative 0.0036
6 UU6 117/1218 63/1213 Positive 3.19e − 5
CC6 47/1218 110/1213 Negative 1.63e − 7
7 UU7 104/1218 67/1213 Positive 0.0036
CA7 70/1218 120/1213 Negative 0.0001
8 CG8 32/1218 51/1213 Negative 0.0323
9 CA9 108/1218 66/1213 Positive 0.0010
GU9 56/1218 84/1213 Negative 0.0138
10 AU10 101/1218 62/1213 Positive 0.0017
CC10 63/1218 96/1213 Negative 0.0062
11 AA11 74/1218 46/1213 Positive 0.0094
GG11 78/1218 111/1213 Negative 0.0114
12 CG12 32/1218 56/1213 Negative 0.0086
13 AU13 108/1218 65/1213 Positive 0.0008
GG13 59/1218 114/1213 Negative 1.22e − 5
14 UU14 105/1218 72/1213 Positive 0.0108
GG14 60/1218 110/1213 Negative 6.10e − 5
15 CA15 113/1218 74/1213 Positive 0.0033
GG15 72/1218 108/1218 Negative 0.0048
16 AC16 82/1218 46/1213 Positive 0.0012
GG16 68/1218 137/1213 Negative 3.82e − 7
17 AC17 80/1218 45/1213 Positive 0.0014
GA17 51/1218 95/1213 Negative 0.0002
18 UC18 114/1218 69/1213 Positive 0.0006
AA18 29/1218 87/1213 Negative 2.76e − 8
19 CU19 124/1218 53/1213 Positive 3.23e − 8
AC19 30/1218 63/1213 Negative 0.0004
20 UG20 146/1218 67/1213 Positive 1.59e − 8
CC20 52/1218 101/1213 Negative 3.73e − 5

Table 1 shows that the 2-mer RNA subsequences that appeared most often as potent were different than those that appeared most often as nonpotent siRNAs. We found that “UU” occurred more often than other 2-mer RNA subsequences in potent siRNAs, whereas “GG” and “CC” appeared most often in nonpotent siRNAs. Most of the “UU” 2-mers were found at positions 1, 4, 6, and 7 of potent siRNAs. In nonpotent siRNAs, “GG” often occurred at positions 1, 13, 14, 15, and 16 and “CC” often occurred at positions 3, 4, 5, 6, and 20.

There are 64 3-mer RNA subsequences, that is, AAA, AAU, etc. In addition, the frequencies of all 3-mer RNA subsequences at positions 1 to 19 are calculated for the two classes. The significance level is calculated by Student's t-test and the 3-mer RNA subsequences with minimal p value are shown in Table 2 (p value < 0.05).

Table 2.

Primary trinucleotides with minimal p value.

Position Trinucleotide motif Freq
(P)
Freq
(N)
Type of corr. p value
1 UUG1 52/1218 5/1213 Positive 9.48E − 10
GGG1 4/1218 50/1213 Negative 1.90E − 10
2 UUA2 14/1218 4/1213 Positive 0.0184
GCC2 10/1218 33/1213 Negative 0.0004
3 AUU3 28/1218 9/1213 Positive 0.0009
CAC3 9/1218 29/1213 Negative 0.0005
4 UAU4 19/1218 5/1213 Positive 0.0021
CCA4 19/1218 41/1213 Negative 0.0019
5 AUU5 29/1218 11 /1213 Positive 0.0021
CCC5 6/1218 30/1213 Negative 2.59E − 05
6 UUU6 40/1218 12/1213 Positive 4.53E − 05
CCA6 10/1218 41/1213 Negative 5.20E − 06
7 UCU7 37/1218 18/1213 Positive 0.005
CGU7 3/1218 16/1213 Negative 0.0013
8 ACA8 29/1218 13/1213 Positive 0.0066
AAU8 8/1218 28/1213 Negative 0.0004
9 CAA9 26/1218 7/1213 Positive 0.0004
AUU9 12/1218 30/1213 Negative 0.0024
10 ACA10 35/1218 11/1213 Positive 0.0002
CGA10 2/1218 12/1213 Negative 0.0036
11 CUA11 32/1218 13/1213 Positive 0.0022
GCG11 6/1218 23/1213 Negative 0.0007
12 AUU12 30/1218 11/1213 Positive 0.0014
GGG12 9/1218 31/1213 Negative 0.0002
13 UUU13 33/1218 16/1213 Positive 0.0074
CCG13 6/1218 20/1213 Negative 0.0028
14 CCA14 36/1218 16/1213 Positive 0.0026
CCC14 6/1218 21/1213 Negative 0.0018
15 UAU15 16/1218 4/1213 Positive 0.0036
UGG15 19/1218 46/1218 Negative 0.0003
16 ACU16 31/1218 12/1213 Positive 0.0018
CGA16 1/1218 10/1213 Negative 0.0032
17 CUG17 49/1218 21/1213 Positive 0.0004
GUU17 9/1218 34/1213 Negative 5.57E − 05
18 UCU18 43/1218 11/1213 Positive 5.54E − 06
AAA18 8/1218 28/1213 Negative 0.0004
19 CUG19 61/1218 16/1213 Positive 9.70E − 08
AGA19 7/1218 31/1213 Negative 4.05E − 05

The results demonstrate that di- and trinucleotides of siRNAs at certain positions can be used as indicators to distinguish between potent siRNAs and nonpotent siRNAs and can possibly be used as a potential feature for siRNA activity prediction.

2.3. Feature Extraction

A total of 230 features are extracted in this section for siRNA activity prediction. These features include 2-3NTs, thermodynamic stability, nucleotide representation, and nucleotide compositions.

2.3.1. 2-3NTs

2-3NTs are categorical features extracted from the di- and trinucleotides of siRNAs.

We defined the feature vector X2NT including 20 categorical features extracted from the dinucleotides of siRNA as follows:

X2NT=Ca1a2,,Capositionaposition+1,,Ca20a21, (1)

where 1 ≤ position ≤ 20.

The categorical feature C(apositionaposition+1) is calculated using the following formula:

Capositionaposition+1=f1×4+s, (2)

where

f=1if  aposition=A2if  aposition=Uor  aposition=T3if  aposition=G4if  aposition=C,s=1if  aposition+1=A2if  aposition+1=Uor  aposition+1=T3if  aposition+1=G4if  aposition+1=C. (3)

Then, the feature vector X3NT, which includes 19 categorical features, is extracted from the trinucleotides of siRNA as follows:

X3NT=Ca1a2a3,,Capositionaposition+1aposition+2,,Ca19a20a21, (4)

where 1 ≤ position ≤ 19.

The categorical feature C(apositionaposition+1aposition+2) is calculated using the following formula:

Capositionaposition+1aposition+2=f1×16+s1×4+t, (5)

where

f=1if  aposition+1=A2if  aposition+1=Uoraposition+1=T3if  aposition+1=G4if  aposition+1=C,s=1if  aposition+1=A2if  aposition+1=Uor  aposition+1=T3if  aposition+1=G4if  aposition+1=C,t=1if  aposition+1=A2if  aposition+1=Uor  aposition+1=T3if  aposition+1=G4if  aposition+1=C. (6)

2.3.2. Thermodynamic Stability

The thermodynamic stability of siRNA may influence the strand selection in the process of RNAi; thus it would influence the RNAi activity [23]. ΔGduplex is the sum of all the siRNA local duplex stability. The siRNA local duplex stability is calculated for every two base pairs along the siRNA duplex and the thermodynamic parameters for calculations were supplied by Xia et al. [29]. The ΔΔG is the ΔG difference of duplex formation at the 5′ and 3′ ends of siRNA for 5 terminal nucleotides.

2.3.3. Nucleotide Representation

Preferred nucleotides at specific positions are important indicators for activity prediction [21]. For example, the nucleotides at the first position of potent siRNAs were most often A or U, while C often appeared at positions 7 and 11 in nonpotent siRNAs [11, 20]. We defined the siRNA as a 21-dimensional vector and indicated the nucleotides at all positions. A, U, G, and C were digitized as 0.1, 0.2, 0.3, and 0.4.

2.3.4. Nucleotide Compositions

The compositions of short motifs of 1–3 nt in siRNA and mRNA contained relevant information for activity prediction [30, 31]. There are 4, 16, and 64 possible subsequences for all 1-mer, 2-mer, and 3-mer RNAs, respectively. Thus, there are 168 features extracted from nucleotide compositions.

2.4. Model Construction

Random Forest (RF) [32] is an ensemble learning method for classification and regression by growing a collection of trees. In the process of regression, the trees are constructed using a training set with M variables. m variables from these M input variables are selected for the construction of an individual tree. The mean prediction of the individual tree will be output when the testing samples are pushed down these trees. Because the RF algorithm can randomly select features to build the ensemble of trees, it has stronger robustness than other methods. In this paper, the RF algorithm was used to develop siRNA activity prediction model.

2.5. Feature Selection

We combined 39 2-3NTs, 2 thermodynamic stabilities, 21 nucleotide representations, and 168 nucleotide compositions to obtain a 230-dimensional feature vector. Since the contributions of these features are different, we proposed BSFS algorithm based on RF-variable importance to select the optimal feature set. The process of the algorithm is shown as follows.

Firstly, all features are ranked in descending order according to its z-score. The z-score is calculated by the RF algorithm to measure the feature importance [32]. To get the z-score, Variable Importance (VI) should be first calculated.

VI of the jth variable was calculated according to the mean decrease in classification accuracy after permuting values of variable xj over all trees. The VI(xj) of each tree t is computed as follows:

VItxj=iβ¯tIyi=y^itβtiβ¯tIyi=y^i,πjtβt, (7)

where β-(t) is OOB samples of tree t.

y^it=ftxi, (8)

where xi is the variable value and y^i(t) is predicted class before permutation.

y^i,πjt=ftxi,πj, (9)

where xi,πj = (xi,1,…, xi,j−1, xπj(i),j, xi,j+1,…, xi,p) is the variable value after randomly permuting the jth variable and y^i,πj(t) is the predicted class after permutation.

Please note that if Xj is not in the tree t, then VI(t)(xj) = 0.

Over all trees, VI(xj) is defined as follows:

VIxj=t=1ntreeVItxjntree, (10)

where n tree is the number of trees in the Random Forest.

Finally, the z-score of the jth feature is defined as follows:

z-scorej=VIxjσ^/ntree, (11)

where σ^ is the standard deviation of the raw importance.

Secondly, the first k features are selected as the optimal features. Set k < m and the calculation process of threshold k is summarized in Algorithm 1.

Algorithm 1.

Algorithm 1

The calculation process of threshold k.

2.6. Model Performance Evaluation

As a validation step, we used the Pearson Correlation Coefficient (PCC) to describe the correlation between experimentally determined and predicted siRNA activity. It may be defined as follows:

PCC=1n1i=1nXiXσXYiYσY, (12)

where n is the sample size and X- and σX are the average value and standard deviation, respectively.

In addition, the Receiver Operating Characteristic (ROC) curve is applied to illustrate the performance of a binary classifier system by plotting sensitivity (Y axis) against 1 − specificity (X axis) at various threshold settings.

Sensitivity=TPTP+FN,Specificity=TNTN+FP, (13)

where TN is the number of true negatives, FN is the number of false negatives, TP is the number of true positives, and FP is the number of false positives.

The area under the ROC curve (AUC) is a single measurement of the algorithm's overall performance, and AUC of 1 and 0.5 represents perfect classification and random classification, respectively.

3. Results and Discussion

3.1. Performance of the 2-3NTs Features

To investigate the importance of di- and trinucleotides of siRNA, we learn two RF regression models trained using Huesken_train and tested on Huesken_test. “model 1” is constructed with 2 thermodynamic stabilities, 21 nucleotide representations, and 168 nucleotide compositions, which are often used for siRNA activity prediction [24]. Then, “model 2” which extended “model 1” by considering 39 2-3NTs was constructed for comparisons.

The experimental prediction results are shown in Figure 1, and the PCC between the observed and predicted siRNA activities for model 1 and model 2 are 0.671 and 0.704, respectively. The prediction efficacy achieved 4.92% improvement after adding the new proposed features. It validates that 2-3NTs are important features for the prediction of siRNA activity.

Figure 1.

Figure 1

Comparison between model 1 and model 2. Observed siRNA activities of the Huesken_test are plotted against predicted siRNA activities by model 1 (a) and model 2 (b).

3.2. Feature Selection Result

The optimal feature set is obtained by our proposed BSFS algorithm. The details of this algorithm are shown in Section 2.5.

Table 3 shows the threshold “k” and the prediction accuracy “PCC” of our model with the top k features for all steps. The results show that, when k = 57, the PCC of our model reaches a maximum of 0.722. Thus, we choose k = 57 as the threshold of the feature selection algorithm.

Table 3.

The performance of our model with the top k features.

Number of features (k) Pearson Correlation Coefficient (PCC)
1 230 0.705
2 230/2 = 115 0.713
3 115/2 = 57 0.722
4 57/2 = 28 0.712
5 28 + (57 − 28)/2 = 42 0.720
6 42 + (57 − 42)/2 = 49 0.721
7 49 + (57 − 49)/2 = 53 0.721
8 53 + (57 − 53)/2 = 55 0.719
9 55 + (57 − 55)/2 = 56 0.721

As shown in Figure 2, 57 features are selected by the BSFS algorithm and ranked in descending order according to z-score. The higher the z-score, the stronger the predictive ability of the feature. There are ten features proposed by our paper in the selective feature set, including the trinucleotides at positions 1, 2, 7, 18, and 19 and the dinucleotides at positions 1, 2, 8, and 19. Significantly, Takahashi noted the terminal bps of RNA (positions 19–21) provide Argonaute protein binding sites [26]. Our results show that “CUG” occurred most often at this position in potent siRNAs. The Argonaute protein is the endonuclease of RNA-induced silencing complexes (RISC) and cleaves the target mRNA whose sequence is complementary to the guide strand of siRNA [26]. We consider that, because the trinucleotide at position 19 is the binding site of the Argonaute protein, it will influence siRNA activity. However, further experiments are needed to validate if the Argonaute protein prefers to bind to potent siRNAs with specific trinucleotides at position 19.

Figure 2.

Figure 2

The 57 features selected by the BSFS method.

Some other features previously proven to be associated with silencing efficacy are selected, including the nucleotides at positions 1, 2, 7 and 19; thermodynamic stability ΔGduplex and ΔΔG; and U%, GGG%, C%, G%, CC%, GG%, GGC%, UGA%, CG%, GCC%, UC%, ACU%, UUC%, AA%, UU%, CGG%, AUG%, AG%, and AGA% of siRNA; AAU%, UUG%, GGG%, AAA%, ACA%, GU%, GCA%, CGU%, GCU%, CU%, GC%, CCG%, AGU%, CGA%, UA%, AU%, UAU%, UAA%, CUC%, GCG%, CUU%, AUU%, and CAU% of mRNA. Graphical boxplots are shown in Figure 3 to display the spread of potent and nonpotent siRNAs for the top 15 features.

Figure 3.

Figure 3

Boxplots of the top 15 features. For each plot, the left side represents potent siRNAs, and the right side represents nonpotent siRNAs.

3.3. Comparison of Algorithms

After finding the optimal feature set, the final model, siRNApred, was created. The parameters N and Mtry are the number of decision trees to be grown in the forest and the number of variables to split at each node, respectively. The default N and Mtry are 500 and D/3. D is the number of features. To find the optimal parameters, we used a grid search method with the step size of 100 and 1. The final results are N = 1000 and Mtry = 24. The PCC between the observed and predicted siRNA activities of our model with these parameters is 0.722, which is a 1.7% improvement compared to the model with default parameters. However, the results are not sensitive to Mtry over the range 24–30 according to our experimental results.

To test the performance of siRNApred, we compared our model with the most state-of-the-art methods for siRNA activity prediction recently reported in the literature. Two experiments were carried out in the same conditions and the comparative evaluation is as follows.

First, our method was compared with Biopredsi [20], i-score [23], ThermoComposition-21 [21], and DSIR [22]. All the algorithms were trained using Huesken_train and tested on Huesken_test. Table 4 shows that the PCC between observed and predicted siRNA activities of our model tested on Huesken_test is 0.722, which is 9.39%, 10.39%, 9.56%, and 7.76% higher than the other four algorithms.

Table 4.

PCC between observed and predicted siRNA activities for five algorithms.

Method PCC (r)
Biopredsi 0.660
i-score 0.654
ThermoComposition-21 0.659
DSIR 0.670
siRNApred 0.722

In addition, the ROC curves combining both sensitivity and specificity of the five methods are plotted (Figure 4). For ROC analysis, siRNAs that produce at least 70% target gene knockdown were accepted as active siRNAs, and those below 70% were considered inactive siRNA. We calculated an AUC of 0.898 for our model, which is better than those obtained from Biopredsi, i-score, ThermoComposition-21, and DSIR.

Figure 4.

Figure 4

ROC curves of the five algorithms.

In siRNA design, more inactive siRNAs predicted as active siRNAs will increase the experimental cost, so siRNA design tools are expected to be capable of rejecting as many false positives as possible and retain the maximum number of true positives. Consequently, we should focus on the area that has higher specificity and compare the sensitivities among different algorithms in this area. Figure 4 shows that in the higher specificity area, siRNApred outperforms all other algorithms. Table 5 shows two group sensitivities of all the algorithms. When the specificity of all algorithms is 96.5%, the sensitivity of our method is 51.9%. The value is higher than Biopredsi, i-score, ThermoComposition-21, and DSIR, which is 16.3%, 24.4%, 28.9%, and 20%, respectively. Our model also performs best when the specificity of all the algorithms is 99.1%. The results demonstrate that our method had more advantages than the other four algorithms for siRNA design.

Table 5.

The five algorithms' sensitivities in the high specificity area.

Method Sensitivity
(96.5% specificity)
Sensitivity
(99.1% specificity)
siRNApred 51.9% 29.6%
Biopredsi 16.3% 8.1%
i-score 24.4% 6.7%
ThermoComposition-21 28.9% 18.5%
DSIR 20.0% 10.4%

A second experiment was conducted to compare our model with the other nine models, including the first-generation siRNA design algorithms Reynolds [11], Ui-Tei [14], Amarzguioui [12], Katoh [9], Hsieh [33], and Takasaki [34] and the second-generation algorithms Biopredsi, i-score, ThermoComposition-21, and DSIR. All the algorithms were trained on Huesken_train and tested on the three independent datasets of Vickers, Reynolds, and Harborth.

Figure 5 shows that siRNApred achieves the highest PCC compared to all nine models on all three independent testing datasets and obtained a higher AUC except when tested on Vickers' dataset. Otherwise, siRNApred produces more stable results across each of the independent siRNA datasets. In addition, the results show that both the PCC and AUC of the first-generation siRNA design algorithms are lower than the second-generation algorithms.

Figure 5.

Figure 5

Comparisons of ten algorithms using the three independent datasets of Vickers, Reynolds, and Harborth.

It was found that siRNApred is more stable and effective than other models in the two experiments. The reason may be that our model takes account into the influence of di- and trinucleotides and removes several redundant features. The comparison results demonstrated that prediction accuracy can be improved significantly when considering the 2-3NTs of siRNA guide strand.

4. Conclusions

Activity prediction of siRNA is a critical step for the successful implementation of RNAi. In this study, we introduced 2-3NTs as our new features. A new mixed 230-dimensional feature set was formed by combining 191 traditional features and our 39 proposed features. Since there were many potential features, the BSFS method based on RF-variable importance was proposed to select the optimal feature set. A total of 57 features were selected as input vectors of the RF model to predict siRNA activity, and nine of our proposed features were included. Significantly, the trinucleotide motif at position 19 was included in the selected feature set, which is the binding site of the Argonaute protein. We found that “CUG” occurred most often at position 19 of potent siRNAs. Further experiments are needed to validate if the Argonaute protein prefers to bind to potent siRNAs possessing a specific trinucleotide at position 19. Finally, we describe a highly accurate and reliable tool called “siRNApred.” It can design effective siRNAs for an input mRNA using an optimal feature set. The experimental comparative evaluation on commonly used datasets showed that siRNApred produced better results than first-generation and second-generation siRNA design methods. Consequently, we consider siRNApred a worthy tool for efficient siRNA design.

Acknowledgments

The authors would like to acknowledge the support of the National Natural Science Foundation of China (NSFC) under Grant no. 61471181, Natural Science Foundation of Jilin Province under Grant nos. 20140101194JC and 20150101056JC, the Fundamental Research Funds for the Central Universities under Grant no. 2412016KJ033, and the open project program of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, under Grant no. 93K172016K04.

Competing Interests

The authors declare that they have no competing interests.

References

  • 1.Timmons L., Fire A. Specific interference by ingested dsRNA. Nature. 1998;395(6705):p. 854. doi: 10.1038/27579. [DOI] [PubMed] [Google Scholar]
  • 2.Montgomery M. K., Xu S., Fire A. RNA as a target of double-stranded RNA-mediated genetic interference in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(26):15502–15507. doi: 10.1073/pnas.95.26.15502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Elbashir S. M., Harborth J., Lendeckel W., Yalcin A., Weber K., Tuschl T. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature. 2001;411(6836):494–498. doi: 10.1038/35078107. [DOI] [PubMed] [Google Scholar]
  • 4.Novina C. D., Sharp P. A. The RNAi revolution. Nature. 2004;430(6996):161–164. doi: 10.1038/430161a. [DOI] [PubMed] [Google Scholar]
  • 5.Aagaard L., Rossi J. J. RNAi therapeutics: principles, prospects and challenges. Advanced Drug Delivery Reviews. 2007;59(2-3):75–86. doi: 10.1016/j.addr.2007.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.McMillen C. M., Beezhold D. H., Blachere F. M., Othumpangat S., Kashon M. L., Noti J. D. Inhibition of influenza A virus matrix and nonstructural gene expression using RNA interference. Virology. 2016;497:171–184. doi: 10.1016/j.virol.2016.07.019. [DOI] [PubMed] [Google Scholar]
  • 7.Wang F., Sun Y., Ruan J., et al. Using small RNA deep sequencing data to detect human viruses. BioMed Research International. 2016;2016:9. doi: 10.1155/2016/2596782.2596782 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wang T., Shigdar S., Shamaileh H. A., et al. Challenges and opportunities for siRNA-based cancer treatment. Cancer Letters. 2017;387(28):77–83. doi: 10.1016/j.canlet.2016.03.045. [DOI] [PubMed] [Google Scholar]
  • 9.Katoh T., Suzuki T. Specific residues at every third position of siRNA shape its efficient RNAi activity. Nucleic Acids Research. 2007;35(4, article no. e27) doi: 10.1093/nar/gkl1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Elbashir S. M., Harborth J., Weber K., Tuschl T. Analysis of gene function in somatic mammalian cells using small interfering RNAs. Methods. 2002;26(2):199–213. doi: 10.1016/S1046-2023(02)00023-3. [DOI] [PubMed] [Google Scholar]
  • 11.Reynolds A., Leake D., Boese Q., Scaringe S., Marshall W. S., Khvorova A. Rational siRNA design for RNA interference. Nature Biotechnology. 2004;22(3):326–330. doi: 10.1038/nbt936. [DOI] [PubMed] [Google Scholar]
  • 12.Amarzguioui M., Prydz H. An algorithm for selection of functional siRNA sequences. Biochemical and Biophysical Research Communications. 2004;316(4):1050–1058. doi: 10.1016/j.bbrc.2004.02.157. [DOI] [PubMed] [Google Scholar]
  • 13.Khvorova A., Reynolds A., Jayasena S. D. Functional siRNAs and miRNAs exhibit strand bias. Cell. 2003;115(2):209–216. doi: 10.1016/S0092-8674(03)00801-8. [DOI] [PubMed] [Google Scholar]
  • 14.Ui-Tei K., Naito Y., Takahashi F., et al. Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Research. 2004;32(3):936–948. doi: 10.1093/nar/gkh247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Schubert S., Grünweller A., Erdmann V. A., Kurreck J. Local RNA target structure influences siRNA efficacy: systematic analysis of intentionally designed binding regions. Journal of Molecular Biology. 2005;348(4):883–893. doi: 10.1016/j.jmb.2005.03.011. [DOI] [PubMed] [Google Scholar]
  • 16.Luo K. Q., Chang D. C. The gene-silencing efficiency of siRNA is strongly dependent on the local structure of mRNA at the targeted region. Biochemical and Biophysical Research Communications. 2004;318(1):303–310. doi: 10.1016/j.bbrc.2004.04.027. [DOI] [PubMed] [Google Scholar]
  • 17.Yiu S. M., Wong P. W. H., Lam T. W., et al. Filtering of ineffective siRNAs and improved siRNA design tool. Bioinformatics. 2005;21(2):144–151. doi: 10.1093/bioinformatics/bth498. [DOI] [PubMed] [Google Scholar]
  • 18.Ren Y., Gong W., Xu Q., et al. siRecords: an extensive database of mammalian siRNAs with efficacy ratings. Bioinformatics. 2006;22(8):1027–1028. doi: 10.1093/bioinformatics/btl026. [DOI] [PubMed] [Google Scholar]
  • 19.Sætrom P., Snøve O., Jr. A comparison of siRNA efficacy predictors. Biochemical and Biophysical Research Communications. 2004;321(1):247–253. doi: 10.1016/j.bbrc.2004.06.116. [DOI] [PubMed] [Google Scholar]
  • 20.Huesken D., Lange J., Mickanin C., et al. Design of a genome-wide siRNA library using an artificial neural network. Nature Biotechnology. 2005;23(8):995–1001. doi: 10.1038/nbt1118. [DOI] [PubMed] [Google Scholar]
  • 21.Shabalina S. A., Spiridonov A. N., Ogurtsov A. Y. Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinformatics. 2006;7, article no. 65 doi: 10.1186/1471-2105-7-65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Vert J.-P., Foveau N., Lajaunie C., Vandenbrouck Y. An accurate and interpretable model for siRNA efficacy prediction. BMC Bioinformatics. 2006;7, article no. 520 doi: 10.1186/1471-2105-7-520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ichihara M., Murakumo Y., Masuda A., et al. Thermodynamic instability of siRNA duplex is a prerequisite for dependable prediction of siRNA activities. Nucleic Acids Research. 2007;35(18, article no. e123) doi: 10.1093/nar/gkm699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Matveeva O., Nechipurenko Y., Rossi L., et al. Comparison of approaches for rational siRNA design leading to a new efficient and transparent method. Nucleic Acids Research. 2007;35(8, article no. e63) doi: 10.1093/nar/gkm088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Thang B. N., Ho T. B., Kanda T. A semi-supervised tensor regression model for siRNA efficacy prediction. BMC Bioinformatics. 2015;16(1, article no. 80) doi: 10.1186/s12859-015-0495-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Takahashi T., Zenno S., Ishibashi O., Takizawa T., Saigo K., Ui-Tei K. Interactions between the non-seed region of siRNA and RNA-binding RLC/RISC proteins, Ago and TRBP, in mammalian cells. Nucleic Acids Research. 2014;42(8):5256–5269. doi: 10.1093/nar/gku153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Harborth J., Elbashir S. M., Vandenburgh K., et al. Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense and Nucleic Acid Drug Development. 2003;13(2):83–105. doi: 10.1089/108729003321629638. [DOI] [PubMed] [Google Scholar]
  • 28.Vickers T. A., Koo S., Bennett C. F., Crooke S. T., Dean N. M., Baker B. F. Efficient reduction of target RNAs by small interfering RNA and RNase H-dependent antisense agents. A comparative analysis. Journal of Biological Chemistry. 2003;278(9):7108–7118. doi: 10.1074/jbc.m210326200. [DOI] [PubMed] [Google Scholar]
  • 29.Xia T., SantaLucia J., Jr., Burkard M. E., et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1998;37(42):14719–14735. doi: 10.1021/bi9809425. [DOI] [PubMed] [Google Scholar]
  • 30.Teramoto R., Aoki M., Kimura T., Kanaoka M. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Letters. 2005;579(13):2878–2882. doi: 10.1016/j.febslet.2005.04.045. [DOI] [PubMed] [Google Scholar]
  • 31.Liu Y., Chang Y., Zhang C., et al. Influence of mRNA features on siRNA interference efficacy. Journal of Bioinformatics and Computational Biology. 2013;11(3) doi: 10.1142/S0219720013410047.1341004 [DOI] [PubMed] [Google Scholar]
  • 32.Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • 33.Hsieh A. C., Bo R., Manola J., et al. A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens. Nucleic Acids Research. 2004;32(3):893–901. doi: 10.1093/nar/gkh238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Takasaki S., Kotani S., Konagaya A. An effective method for selecting siRNA target sequences in mammalian cells. Cell Cycle. 2004;3(6):790–795. [PubMed] [Google Scholar]

Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES