Table 1.
Category | Manualb
|
Automatedc
|
Automated-Plusd
|
||||
---|---|---|---|---|---|---|---|
rms | n | rms | n | rms | sdev | n | |
Hydrogen | |||||||
Cross-validated | 0.12 | 12,131 | 0.12 | 11,953 | 0.13 | 1.33 | 18,774 |
Canonicale | 0.05 | 2007 | 0.05 | 2061 | 0.06 | 1.28 | 3020 |
Non-canonicalf | 0.05 | 2059 | 0.06 | 1966 | 0.07 | 1.29 | 2903 |
Otherg | 0.09 | 8065 | 0.10 | 7926 | 0.11 | 1.35 | 12,851 |
All | 0.08 | 12,131 | 0.08 | 11,953 | 0.10 | 1.33 | 18,774 |
Carbon | |||||||
Cross-validated | 0.80 | 5554 | 0.80 | 5559 | 0.83 | 28.44 | 9642 |
Canonicale | 0.41 | 1040 | 0.41 | 1072 | 0.46 | 28.04 | 1630 |
Non-canonicalf | 0.42 | 949 | 0.42 | 916 | 0.47 | 28.34 | 1526 |
Otherg | 0.79 | 3565 | 0.81 | 3571 | 0.85 | 28.57 | 6486 |
All | 0.68 | 5554 | 0.69 | 5559 | 0.75 | 28.44 | 9642 |
Output from the support vector regression analysis. The SVR is done separately on each atom type. This table presents the values aggregated across all the hydrogen and carbon atoms used. The columns labeled rms represent the square root of the mean of squared deviations between predicted and experimental values for all the data in the corresponding category. The rms values in the cross-validated rows are the output from the SVR program when performing a tenfold stratified cross-validation and are based on the data values in all categories. Other rms values are calculated on the indicated subset of data values. The columns labeled n represent the number of data values used in the specified category. The column labeled sdev is the standard deviation of all the experimental hydrogen or carbon shifts in the corresponding categories and is included only for the automated-plus section as this measure of dispersion is very similar for all three groups
Manual refers to analysis done using the attribute templates created by manual analysis and the shift datasets used in our previous analysis (Barton et al. 2013)
Automated refers to analysis done using the mostly-automated attribute generation described in this paper using the same set of datasets as in our previous analysis
Automated-Plus refers to analysis done using the automated analysis described here and the new larger number of datasets
Canonical bases are the central base in a 5 base stretch in which all 5 base pairs have GC or AU base pairing and no other attributes such as being in a triplet, kissing interaction or pseudoknots are present
Non-canonical bases are the same as canonical, but the first and/or fifth bases may be GU wobble base pairs, mismatched, unpaired (e.g. loops) or not-present (e.g. the 5′ or 3′ termini)
Other bases are all bases that are in neither the canonical nor non-canonical categories