An official website of the United States government
Here's how you know
Official websites use .gov
A
.gov website belongs to an official
government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you've safely
connected to the .gov website. Share sensitive
information only on official, secure websites.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
A list of sequence motifs of lengths between k1 and k2 that are rank imbalanced in at an mHG significance level better than τ.
Preprocessing:
Construct a generalized suffix tree for such that:
All suffixes of all sequences are represented by paths from the root to leaves in the tree.
Each leaf contains information about the occurrences of the corresponding suffix w in . This information is represented as a list . The values mi(w) are the indices, amongst , of the sequences at which w occurs.
/* The construction is implemented using Ukkonen's algorithm (41) */
Algorithm:
for k = k1 to k2 do:
Traverse the tree to find paths of length k, and for each path P calculate P's enrichment using the following process:
Get the ordered list of indices (ranks) of sequences containing P, extracted from the leaves of the subtree rooted below P. /*P occurs in the union of the lists of all leaves of that subtree, as it is the prefix of all the suffixes represented by these leaves. For example, assuming P appears in S8,S14,S31 and S36, then C = {8,14,31,36} */
Calculate the mHG score for P: /* Following the example above and assuming we have 100 sequences in the input: In this case attained at i = 4 where */