Table 1. Comparison of different methods for RNA secondary structure prediction.
Method | Architecture # free tied parameters |
Scoring scheme | Parameterization | Training datasets | Folding method | Benchmark Set best F (%) |
||
---|---|---|---|---|---|---|---|---|
|
(6 bps) |
(16 bps) |
|
TestSetA |
TestSetB |
|||
g6 |
11 |
21 |
probabilistic |
ML |
TrainSetA+2*TranSetB |
c-MEA |
49.1 |
47.5 |
●basic grammar |
532 |
572 |
probabilistic |
ML |
TrainSetA+2*TranSetB |
c-MEA |
56.9 |
56.5 |
⋄CONTRAfold v2.02 |
~300 |
- |
weights |
CML |
S-Processed-TRA |
c-MEA |
57.2 |
57.9 |
●CONTRAfoldG |
1,278 |
5,448 |
probabilistic |
ML |
TrainSetA+2*TranSetB |
c-MEA |
58.3 |
58.6 |
⋄UNAFold-3.8 |
~3,500 |
- |
thermodynamic |
fit to exp. data |
- |
CYK |
51.0 |
51.3 |
⋄Simfold BL* |
~as above |
- |
weights |
CML |
S-Processed-TRA |
CYK |
56.5 |
55.3 |
⋄RNAstructure v5.2 |
~12,700 |
- |
thermodynamic |
fit to exp. data |
- |
GCE |
53.5 |
53.8 |
⋄ViennaRNA v1.8.4 |
~as above |
- |
thermodynamic |
fit to exp. data |
- |
GCE |
53.7 |
54.3 |
●ViennaRNAG |
14,307 |
90,497 |
probabilistic |
ML |
TrainSetA+2*TranSetB |
c-MEA |
60.2 |
59.4 |
●ViennaRNAGz_bulge2_ld_mdangle |
14,557 |
91,997 |
probabilistic |
ML |
TrainSetA+2*TranSetB |
c-MEA |
60.5 |
59.5 |
⋄ContextFold v1.00 | 205,000 | - | weights | online CML | S-Full | CYK | 64.4 | 49.0 |
Models. Models with a “⋄” are versioned stand-alone packages. Models with a “●” are CFGs (with alternative scoring schemes) introduced in reference 39. In particular, ViennaRNAG is a CFG that when parameterized with thermodynamic scores reproduces the ViennaRNA v1.8.4 method, and CONTRAfoldG is another CFG that when parameterized with particular scores reproduces CONTRAfold v2.02. Here, we present the results of probabilistic parameterizations for those grammars. Parameters. Methods are order by increasing number of parameters. Here we report the effective free parameters after tying. (The number of parameters for some of the native thermodynamic methods is only approximate and corresponds to two different versions of the nearest-neighbor model.) Test sets. TestSetA is a well curated collection of sequences from about 10 bona-fide RNA structures. TestSetB includes a collection of about 22 different RNA structure obtained from Rfam v10.0. TestSetA and TestSetB are structurally dissimilar, and they have been defined in reference 39. Performance accuracy. We use F (the harmonic mean of sensitivity and positive predictive value), such that an F of 100% would mean perfect prediction. Performance accuracy is calculated for the entire test set of sequences (instead of averaging the accuracy of each individual sequence). This “total” measures tend to be smaller than those obtained by averaging over sequences because it corrects for the (usually abundant) small sequences in the test sets for which prediction is much easier than for longer sequences. For methods that use a MEA algorithm with a tunable parameter (both c-MEA31 and GCE36), this table report the “best F” in the ROC curve between sensitivity and positive predictive value (see ref. 39 for more details). Training sets. Provenance of training sets is as follows: TrainSetA+ 2*TrainSetB ,39 S-Processed-TRA,33 S-Full.34