. 2019 Mar 23;35(20):3944–3952. doi: 10.1093/bioinformatics/btz198

Table 1.

Suffix array peak positions with ${\hat{y}}_{i} \geq θ$

graphic file with name bioinformatics_35_20_3944_f5.jpg

Note: Illustration of motif selection process (Section 2.1) applied to simulated data (using kernel half-width κ = 4). All positions for which sequence smoothed score ${\hat{y}}_{i} \geq θ = 1$ are shown; table is sorted in descending order of the estimated motif length ${\hat{k}}_{i}$ . Columns indicate values of key variables for the suffix associated with the corresponding peak: (i) suffix array index i giving position of suffix in lexicographically sorted list of all suffixes; (s_i) suffix array value s_i giving spatial position of suffix in concatenated sequence x; ( ${\hat{y}}_{i}$ ) kernel-smoothed score ${\hat{y}}_{i}$ (Equation 4); ( ${\hat{k}}_{i}$ ) estimated length ${\hat{k}}_{i}$ (Equation 7) of conserved $⌊ {\hat{k}}_{i} ⌉$ -mer prefix of suffixes within smoothing window centered on suffix array index i; ( $x [s_{i}, s_{i} + ⌊ {\hat{k}}_{i} ⌉)$ ) the corresponding conserved $⌊ {\hat{k}}_{i} ⌉$ -mer $x [s_{i}, s_{i} + ⌊ {\hat{k}}_{i} ⌉)$ (Equation 9); (b_i) the input sequence b_i (Equation 3) from which the suffix is derived; (ω_i) the spatial position ω_i at which the suffix is found within sequence b_i; and (g_i) the Gini impurity g_i (Equation S4) for the smoothing window centered at i. Note that each of these peaks corresponds to a suffix derived from a position within the first three characters of an instance of the embedded motif CATACTGAGA. Gold highlighting indicates peaks starting from the first character of the embedded motif, silver the second and bronze the third.