Skip to main content
. 2012 Mar 13;40(13):5832–5847. doi: 10.1093/nar/gks206

Table 2.

DRIMUST – variable gapped motif search algorithm

The algorithm is schematically described in Figure 1.
  • Input:
    • A ranked list of sequences S1, … , SN
    • Parameters Inline graphic where a represents the length of the first half site, b represents the length of the second half site and Inline graphic represents the maximum gap.
    • P-value threshold for reporting (τ)
  • Output:
    • A list of sequence motifs of the form H1-NΛ-H2, where Λ is a set of gaps. These reported motifs are rank imbalanced in S1, … , SN at an mHG significance level better than τ.
    • /* The interpretation of the above motif representation is as follows. A motif is viewed as a set of strings. In this case all strings that start with H1, then have a wildcard gap of any of the lengths in the set of gaps Λ, and end with H2. For example, the motif GCC-N1,5-ATG represents the strings GCCNATG and GCCN5ATG */
  • Preprocessing:
    • Construct a generalized suffix tree for S1, … , SN such that:
      • All suffixes of all sequences S1, … , SN are represented by paths from the root to leaves in the tree.
      • Each leaf contains information about the occurrences of the corresponding suffix w in S1, … , SN. This information is represented as a list m1(w), … , mN(w)(w), where mi(w) are the indices, amongst 1, … , N, of the sequences at which w occurs.
    • /* The construction is implemented using Ukkonen's algorithm (41) */
  • Algorithm:
    • Traverse the tree to find paths of length a, and for each path P do:
      • Compute the set of all strings σ1(P), … , σN(P)(P) of length Inline graphic that start at the position where P ends in all sequences among S1, … , SN in which it occurs. This step is implemented by traversing the subtree rooted at P. /* The strings σi(P) are typically of length Inline graphic. When P occurs close to the end of Si, a string of length smaller than Inline graphic is taken into the above set */
      • Construct a generalized suffix tree T(P) for σ1(P), … , σN(P)(P) such that:
        • □ All suffixes of all sequences σ1(P), … , σN(P)(P) are represented by paths from the root to leaves in the tree.
        • □ Each leaf contains information about the occurrences of the corresponding suffix u in σ1(P), … , σN(P)(P) as well as the positions of these occurrences. This information is represented as a list of pairs:
          graphic file with name gks206um1.jpg
        • Where Inline graphic are the indices of the sequences in Inline graphic at which u occurs, and each value ti(u) represents the starting position of u within σmi(u).
      • Traverse T(P) at depth b. For each such path Q calculate the enrichment of all possible motifs of the form P-NΛ-Q, where Λ is any subset of Inline graphic, using the following process: /* This step uses the suffix tree information to avoid searching over all Inline graphic possible instantiations of Λ, leading to improved efficiency of the algorithm */
        • □ Use the values ti that are in the leaves of the subtree rooted below Q to infer
        • Inline graphic, representing all gaps for which a string P-Nλi-Q, where Inline graphic, occurs.
        • /* Inline graphic, where α ranges over all substrings for which Qα is a suffix in T(P) */
        • □ For every Inline graphic do: /* this is efficient when Inline graphic */
          •  □ Infer an ordered list Inline graphic, which represents all indices in the original list S1, … , SN at which a string of the form P-Nλ-Q, where λ∈Λ, occurs.
          •  □ Use the list Inline graphic to compute the mHG score for P-NΛ-Q:
            graphic file with name gks206um2.jpg
          •  □ Report P-NΛ-Q if Inline graphic holds. /*Inline graphic (20) */