Table 1. Benchmark dataset statistics and training alignment information content.
Family | Ntrain | Ntest | Consensus Length | 1° info (bits) | 2° info (bits) | 3° info (bits) |
---|---|---|---|---|---|---|
tRNA | 1357 | 30 | 72 | 40.3 | 26.7 | 1.2 |
Twister | 1005 | 40 | 65 | 57.5 | 4.5 | 1.3 |
SAM | 192 | 8 | 108 | 121.8 | 11.1 | 0.0 |
Hand-curated MSAs are split into training and test sets based on [45]. For each training MSA, information content in the primary sequence (in bits) is calculated [39], while information in secondary structure (nested base pairs) and tertiary structure (all other disjoint pairwise interactions between sites) is estimated using mutual information [6]. Each family’s consensus structure is inferred using CaCoFold and R-scape on the training alignment [41, 42]. Though R-Scape identifies no tertiary structure using the SAM riboswitch training alignment, a four-base pair pseudoknot has been observed experimentally [46]. This lack of pseudoknot detection is a characteristic of our SAM training alignment; R-scape predicts the pseudoknot when analyzing the RF00162 seed alignment.