. 2012 Mar 13;40(13):5832–5847. doi: 10.1093/nar/gks206

Table 2.

DRIMUST – variable gapped motif search algorithm

The algorithm is schematically described in Figure 1.

Input:
- A ranked list of sequences S₁, … , S_N
- Parameters where a represents the length of the first half site, b represents the length of the second half site and represents the maximum gap.
- P-value threshold for reporting (τ)
Output:
- A list of sequence motifs of the form H₁-N^Λ-H₂, where Λ is a set of gaps. These reported motifs are rank imbalanced in S₁, … , S_N at an mHG significance level better than τ.
- /* The interpretation of the above motif representation is as follows. A motif is viewed as a set of strings. In this case all strings that start with H₁, then have a wildcard gap of any of the lengths in the set of gaps Λ, and end with H₂. For example, the motif GCC-N^1,5-ATG represents the strings GCCNATG and GCCN⁵ATG */

Preprocessing:
- Construct a generalized suffix tree for S₁, … , S_N such that:
  - All suffixes of all sequences S₁, … , S_N are represented by paths from the root to leaves in the tree.
  - Each leaf contains information about the occurrences of the corresponding suffix w in S₁, … , S_N. This information is represented as a list m₁(w), … , m_N(w)(w), where m_i(w) are the indices, amongst 1, … , N, of the sequences at which w occurs.
- /* The construction is implemented using Ukkonen's algorithm (41) */

Algorithm:
- Traverse the tree to find paths of length a, and for each path P do:
  - Compute the set of all strings σ₁(P), … , σ_N(P)(P) of length that start at the position where P ends in all sequences among S₁, … , S_N in which it occurs. This step is implemented by traversing the subtree rooted at P. /* The strings σ_i(P) are typically of length . When P occurs close to the end of S_i, a string of length smaller than is taken into the above set */
  - Construct a generalized suffix tree T(P) for σ₁(P), … , σ_N(P)(P) such that:
    
    □ All suffixes of all sequences σ₁(P), … , σ_N(P)(P) are represented by paths from the root to leaves in the tree.
    
    □ Each leaf contains information about the occurrences of the corresponding suffix u in σ₁(P), … , σ_N(P)(P) as well as the positions of these occurrences. This information is represented as a list of pairs:
    
    Where are the indices of the sequences in at which u occurs, and each value t_i(u) represents the starting position of u within σ_{m_i(u)}.
  - Traverse T(P) at depth b. For each such path Q calculate the enrichment of all possible motifs of the form P-N^Λ-Q, where Λ is any subset of , using the following process: /* This step uses the suffix tree information to avoid searching over all possible instantiations of Λ, leading to improved efficiency of the algorithm */
    
    □ Use the values t_i that are in the leaves of the subtree rooted below Q to infer
    
    , representing all gaps for which a string P-N^λ_i-Q, where , occurs.
    
    /* , where α ranges over all substrings for which Qα is a suffix in T(P) */
    
    □ For every do: /* this is efficient when */
    
    □ Infer an ordered list , which represents all indices in the original list S₁, … , S_N at which a string of the form P-N^λ-Q, where λ∈Λ, occurs.
    
    □ Use the list to compute the mHG score for P-N^Λ-Q:
    
    □ Report P-N^Λ-Q if holds. /* (20) */