PIDA:A new algorithm for pattern identification

C Putonti; BM Pettitt; JG Reid; Y Fofanov

. Author manuscript; available in PMC: 2009 Oct 14.

Published in final edited form as: Online J Bioinform. 2007 Jan 1;8(1):30–40.

PIDA:A new algorithm for pattern identification

C Putonti ^1,², BM Pettitt ^1,^2,³, JG Reid ³, Y Fofanov ^1,²

PMCID: PMC2761635 NIHMSID: NIHMS77407 PMID: 19834570

Abstract

Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation “islands” into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can be incorporated into each pattern through insertions/deletions within the “water” separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern’s likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been created to be as generally applicable as possible since there are a variety of sequence problems of this type. The algorithm was implemented in C++ and is freely available upon request from the authors.

Keywords: pattern discovery, motif conservation, variable length patterns

INTRODUCTION

There are a variety of different biomolecules which have long-range intramolecular correlations that are essential to their function. For example, proteins with separate (but related) active sites, non-coding RNAs that require specific structural elements for functionality, and DNA carrying sequence signals read by transcription factors all include related information separated by regions which serve no related function (other than to connect the correlated, but disparate, elements). In principle, finding such elements in sequence space (whether that be DNA or protein sequence space) is the same problem, namely one of finding patterns defined by regions of strong sequence conservation separated by regions which have evolved under the absence of or reduced selective pressure.

Many different algorithms and applications have been developed to search for pattern (motif) conservation in order to predict conservation of function within RNA (see for example, Pavesi et al., 2004; Anwar et al., 2006; Liu et al., 2005; Yao et al., 2006; and Ji et al., 2004), DNA (see for example, Tompa et al., 2005; and Bailey and Elkan, 1994) and amino acid sequences (see for example, Bailey and Elkan, 1994; and Neuwald et al., 1995; and Xing et al., 2004). The predominant application for motif discovery, however, has been identification of protein binding sites in the DNA sequence. For binding site recognition alone, a variety of different approaches have been taken, including alignment algorithms (see for example, McGuire et al., 2000; McCue et al. 2001; and Roth et al., 1998), weighted matrices (see for example, Hertz and Stormo, 1999; Li et al., 2002; Mwangi and Siggia, 2003; Quandt et al., 1995; and Lenhard and Wasserman, 2002), Markov chain models (see for example, Sinha and Tompa, 2000; and Yada et al., 1998), and syntax trees (see for example, Marson and Sagot, 2000; and Vanet et al., 2000), and many tools are publicly available, including MEME (Bailey and Elkan, 1994), Gibbs sampling (Neuwald et al., 1995), Consensus (Hertz and Stormo, 1999), Block Maker (Henikoff et al., 1995) and TEIRESIAS (Rigoutsos and Floratos, 1998). Regardless of which method is employed or the application to which the algorithm is being applied, each is developed with a specific definition of a pattern in mind. The pattern can be considered as one word encompassing the conserved subsequence(s) or a collection of words in which each word is a conserved subsequence. The flexibility permitted within these patterns, namely the flexibility in non-conserved regions or gaps, varies amongst the algorithms. The majority of the aforementioned algorithms, however, do not consider flexible gaps as it is computationally more expensive than only considering fixed-length, rigid patterns.

Herein the Pattern Island Detection Algorithm (PIDA) is presented, a new method to identify patterns within sequences containing “islands” or subsequences of perfect sequence conservation separated by “water” or intervening regions of arbitrarily long unmatched sequence. PIDA works by comparing at least two sequences of any alphabet and identifying patterns which contain islands of matching sequence separated by arbitrary amounts of unmatched sequence, or water. The identification of these patterns is a two-step process. First, PIDA creates the set of initial patterns containing all islands shared by the sequences being compared. Next, PIDA uses this initial set to generate the set of derivative patterns with the aid of a “floating” operation (the operation which allows for flexibility in the number of gaps between islands). PIDA executes the search for patterns based upon parameter values specified by the user. Unique to PIDA is the ability to search for and identify patterns which may differ in length between their occurrences in sequences. In other words, PIDA is able to accommodate flexible gaps and still recognize patterns having the same conserved islands even if they vary in their relative distance from each other (size of water) from one sequence to another, accommodating insertions/deletions within the unconserved water regions of the pattern. (Two patterns containing the same set of islands are said to be the same pattern if the length of the water between the islands within a user defined range.) While PIDA is effective at finding fixed-length (and/or contiguous) patterns, algorithms designed specifically for that problem will almost certainly outperform PIDA given that it is their raison d’etre.

In order to provide benchmarks for PIDA’s performance, pattern searches were conducted for both randomly generated sequences and sequences known to contain conserved patterns. The significance of the appearance of patterns within the sequences can be assessed based upon the likelihood of the pattern to appear by chance.

METHODS

Figure 1 illustrates the two-step process PIDA follows to identify patterns. While PIDA can search for patterns in sequences of letters from any alphabet, the DNA four-character alphabet will be used as an example. In the comparison of the two sequences in Figure 1(a), two islands are identified: AGCCGA (I₁) and TGGGC (I₂) separated by nine bases. Thus, the set of initial patterns includes a pattern that contains islands I₁ and I₂. Additionally, the set of initial patterns contains a pattern consisting of a single island I₃. This island is separated from I₂ by five bases in the first sequence and seven bases in the second. By floating I₃ two bases in the first sequence, a pattern containing three islands is identifiable in both sequences as shown in Figure 1(b). Likewise, a pattern containing these three islands can be identified by removing two of the bases from the water between I₂ and I₃ in the second sequence of Figure 1. This new pattern, which includes islands I₁ I₂, and I₃, is contained within the set of derivative patterns.

Example of a pattern with DNA sequence matched regions (islands) in green and intervening unmatched sequence (water) in blue.

PIDA executes the search for patterns based upon parameter values specified by the user: the maximum number of mutations or insertions/deletions allowed (f for floating), the minimum length of an island (l) and the minimum number of islands needed to make a pattern (s). Parameter l, the minimum length of an island, is strictly enforced when creating both the set of initial patterns and the set of derivative patterns. The minimum number of islands s, however, is not enforced when creating the set of initial patterns as smaller patterns having less than s islands may be part of a larger pattern as a result of floating. Thus, it is not until the set of derivative patterns is formed that those patterns containing too few islands are removed from the set.

The Island Data Structure

A pattern is dynamically created as an array of instances of the Island object, a new data structure designed to store information about the conserved region. A bit-wise representation is used to represent the conserved sequence of length n or n-mer rather than the actual characters from the sequence in an effort to minimize memory usage. Another benefit of utilizing the bit-wise representation is that comparisons between patterns can be achieved by using bit-shifting operations. For each island three positions are stored: the starting point of the island in the first sequence (ap), the starting point of the island in the second sequence (bp), and the relative position of the island or the distance of this particular island from the starting point of the first island (rp). Positions ap and bp are essential for creating the set of derived patterns. Storing the relative position eliminates the need to store any information about the water separating the islands; if the island is the first island in the pattern the relative position is zero. The class Pattern was created to associate all of the Island objects belonging to the same pattern as well as the number of islands within the pattern.

The Set of Initial Patterns

To build the collection of patterns conserved between two sequence sets, PIDA starts by creating a set of initial patterns (S_i). The algorithm first identifies the patterns shared by the sequences through base-by-base comparisons. To illustrate this process, consider two sequences, A (of size m_a) and B (of size m_b). To acquire the elements of S_i, PIDA must perform m_a+m_b comparisons of overlapping sequences, each of which may identify a significant pattern. This process is accomplished by “shifting” B along A to find the correlations. At each shift, PIDA searches for the longest pattern present within the two sequences. Figure 2 illustrates the case when three islands, I₁, I₂ and I₃, are located in the sequence overlap area (l_q). In this example, PIDA would identify the pattern containing I₁, I₂ and I₃ and would not consider its subpatterns (I₁; I₂; I₃; I₁ and I₂; I₁ and I₃; and I₂ and I₃) as patterns. Thus, only the larger pattern of I₁, I₂ and I₃ is stored within the set of initial patterns.

Sequence B has “shifted” q positions relative to its initial alignment. The black bars labelled I₁, I₂, and I₃ denote three islands shared by the two sequences.

Creating the Set of Derivative Patterns

The floating operation iterates through the set of initial patterns searching for opportunities to identify new or more complex patterns to populate the set of derivative patterns, S_d. This operation requires three pieces of information to be temporarily stored for each Island instance: the “shift” in which the initial pattern was found (shift), ap, and bp. Moreover, the parameter f is used to set an upper bound on the number of insertions or deletions allowed in a single pattern. Figure 3 shows an example of a pattern that could be included in the set of derivative patterns. Just as was the case when creating the set of initial patterns, only the longest pattern, in the case of Figure 3 the pattern of I₁, I₂ and I₃, is stored within the set of derivative patterns.

A pattern that could be included in the set of derived patterns. The black bars labelled I₁, I₂, I₃ denote three islands shared by the two sequences. Island I₃ would not be associated with I₁ and I₂ in the set of initial patterns.

In Figure 3, an insertion may have occurred in the region of A between I₂ and I₃. Likewise, a deletion may have occurred in the same region of B. The floating operation will account for this insertion/deletion by creating the more complex pattern containing the three islands. In addition to generating more complex patterns, floating can also “elongate” existing islands (make a larger island out of two smaller islands). Figure 4 further illustrates the creation of derivative patterns by the floating operation. In Figure 4(a), the island AC can be floated to create a new pattern with 3 islands while the island CG in 4(b) a new pattern cannot be added to the set of derived patterns due to this conflict of islands in sequence B.

An example of how the floating operation **(a)** finds a new valid pattern and **(b)** does not find a new valid pattern. In **(b),** the island of Pattern[i+2] is actually a part of Pattern[i]’s island in Sequence B.

Searching for Patterns in Multiple Sequences

Thus far, pattern discovery has been confined to the case in which just two sequences are considered. When looking at more than two sequences, pair-wise comparisons are performed with the introduction of another user parameter, t, representing the minimum number of sequences for which the pattern must be present in order to be reported to the user. In the simplest case, one can require patterns to be present in every sequence considered (the “always present” approach). From the comparison of the first two sequences the maximum number of patterns which can be present in all of the sequences is identified and successive comparisons with the remaining sequences will only be performed in order to determine if the pattern shared between the first two sequences is present in all other sequences. Patterns occurring in all but (at least) one sequence will not be identified by this approach. Rather such patterns can be identified by lowering the minimum threshold t (the “frequently present” approach). In this approach, the set of initial patterns and the set of derivative patterns must be created for each of the sequences under consideration.

Probability of Finding the Pattern by Chance

For any given pattern, one can determine the probability that the same pattern will occur in a sequence by chance. In doing so, it is possible to assess the novelty of the pattern’s appearance in the set of sequences. A pattern present in α sequences can be thought of as a set of N n-mers separated by a variable region. Figure 5 illustrates such a pattern for just one of the α sequences in which the pattern is contained; there are three (N=3) islands (indicated as black boxes) of length n₁,n₂ and n₃ within the sequence and the distance between I₁ and I₂ is equal to d₁ and the distance between I₂ and I₃ is equal to d₂. It is important to note that for this approximation, conservation between the sequences within the d₁ and d₂ regions may occur as long as the length of the conservation is less than the minimum length l stipulated by the user.

Visualization of the parameters used to describe a pattern for calculating the probability that the same pattern will occur by chance in α sequences.

Thus, the probability that a pattern is to occur in α sequences (of an alphabet of size A) can be calculated as:

P = \prod_{i = 1}^{α} (L_{i} * (\frac{1}{A^{\sum_{j = 1}^{N} n_{j}}} * \prod_{k = 1}^{N - 1} (d_{k}^{max} - d_{k}^{min})))

where L_i is the length of the i^th sequence in which the pattern is contained, N is the number of islands and n_j is the length of the j^th island. The parameters $d_{k}^{min}$ and $d_{k}^{max}$ represent the minimum and maximum distance allowed between two islands separated by water k.

The calculation of P assumes that the presence of each character in the alphabet is equal. For genomic sequences, it is known that the presence/absence of particular n-mers is in fact not independent, as particular n-mers are highly repetitive (e.g. ALU) sequences, micro-satellites, etc.), while others are rarely present (e.g. sequences recognized by restriction enzymes, etc.). Despite this, operating under the assumption of independence provides an overestimation of the frequency of an n-mer’s presence by chance. It must be noted, that this is not a calculation of the p-value, rather a quantification of the probability that a pattern will appear in a randomly generated sequence of the same length where patterns having a smaller value of P are less likely to appear by chance. Moreover, P can be greater than 1 when L ≥ Aⁿ.

RESULTS AND DISCUSSION

Performance Analysis

The run-time estimate for building the set of initial patterns common between two sequences is O(m_a*m_b) where m_a and m_b are the lengths of the first and the second sequence. For sequences of approximately the same length, this can be simplified as O(m²). When searching for patterns that appear in more than two sequences, the always present approach, only one pair-wise sequence comparison is necessary and the patterns identified for the first two sequences are then compared to all of the other sequences. Therefore, building the set of initial patterns following the always present approach has the run-time estimation of O((α−1)*m²) where α is the number of sequences. In the more likely case for pattern discovery in which the pattern is not expected to occur in all of the sequences in a group, comparisons between each pair of sequences must be carried out resulting in a run-time estimate of

O (\frac{α (α - 1)}{2} * m^{2}) .

A new data structure was designed in order to minimize the amount of memory needed to store each pattern (see the Methods section). For randomly generated sequences as well as the upstream sequences from Escherichia coli K12 [NCBI: NC_000913] and Bacillus subtilis subsp. subtilis str. 168 [NCBI: NC_000964], the initial set of patterns of islands that are at least four nucleotides long (l=4) was created. As the parameter s is not considered when creating the set of initial patterns, the patterns can contain as few as one individual island under the condition that its length is greater than or equal to the user parameter l. Likewise, the set of initial patterns can contain patterns having more islands than the minimum parameter s. The run-times and the average number of unique patterns within the set of initial patterns for the three sets of sequences are shown in Figure 6. The time and memory usage is modest for all sequences of length up to 1000 bp.

Run-time performance and memory usage (on a 3GHz Pentium 4 with 1GB of RAM) for creating the set of initial patterns where l = 4 using two sequences from the upstream regions of two functionally related genes in *E. coli,* two from *B. subtilis,* and two randomly generated sequences.

The set of derivative patterns can be much larger than the set of initial patterns. The upper bound of the number of potential derivative patterns is the number of patterns in the initial set times the number of insertion/deletions allowed in a pattern (f) times four. The coefficient of four represents the fact that both base insertions and base deletions are considered in both sequences. Since the largest number of patterns that may be contained in the set of initial patterns is m, the derivative set may contain up to 4mf patterns. For each of the possible m patterns in the set of initial patterns, 2f comparisons must be made. These comparisons, however, are not base-to-base comparisons; rather they are implemented through a manipulation of the data structure (see the Methods section). The run-time operation count for creating the set of derivative patterns from the initial set of patterns common between two sequences is better estimated as O(m*2f) ≈ O(mf). When following the always present approach for more than two sequences, this estimate can be formalized as O(αmf), which is simpler than the estimation for the frequently present search -

O (\frac{α (α - 1)}{2} * m f) .

In Figure 7 one can see that the time and memory usage is acceptable for all sequences of length up to 1000 bp when the floating operation is implemented. It is important to note that during the creation of the set of derivative patterns, the parameter s is enforced. Therefore, the patterns reported in Figure 7 which are identified under the conditions of s = 2 and l = 4 include patterns in which every island is at least 4 nucleotides long and each pattern includes 2 or more islands. While PIDA analysis results from two sequences may identify up to m+4mf patterns, the results in Figure 7 suggest that significantly fewer patterns are found, even in sequences from the same organism.

Run-time performance and memory usage for creating only the set of derivative patterns using the same sequences as Figure 6 for s = 2, l = 4 and f = 1.

The two-step process of generating both the set of initial and derivative patterns yields a run-time estimate of O((α−1)*m²+ αmf) when following the always present approach and

O (\frac{α (α - 1)}{2} * m^{2} + \frac{α (α - 1)}{2} * m f)

when following the frequently present approach.

Preliminary Experimental Results

To compare the results obtained by PIDA with those of existing methods, 15 transcription factor binding patterns for E. coli and the respective set of promoter sequences were used. These 15 tests were selected because both the methods of Robison et al. (1998) and Li et al. (2002) used the same set of sequences to identify the candidate binding site. As a result of these tests, with l equal to 4, s equal to 2 and f equal to 3, PIDA was able to identify all of the referenced patterns. Next, the 325 nt upstream sequence from the start of translation for the remaining ≈2500 annotated genes (excluding those annotated as a hypothetical gene) in E. coli were collected from NCBI. The 15 patterns which were recognized by PIDA, Robison et al. (1998) and Li et al. (2002) were then compared to all of the ≈2500 upstream sequences in hopes of identifying new instances of the pattern. For the binding site within the LexA group, having the pattern CTGTNNNNNNNNACAG (Robison et al., 1998), three additional occurrences, in the upstream regions of PolB (DNA polymerase II), PurT (phosphoribosyglycinamide formyltransferase 2), and RecA (recombinase A), of this pattern were identified. The inclusion of RecA is not surprising as RecA and LexA both regulate the SOS response (Thliveris and Mount, 1992). The probability that this pattern would occur in the 350 nt sequences of LexA and the three additional occurrences is 2*10⁻⁷. When the parameters were altered such that the minimum length of the island need only be 3, a very similar pattern CTGTNNNNNNNNNCAG was observed in the upstream region of the sequences for MalZ (maltodextrin glucosidase), NapH (quinol dehydrogenase membrane component), PtsN (sugar-specific enzyme IIA component of PTS), and TldD (predicted peptidase). The functional significance of identifying these factors is unknown at this time. Comparison of each binding pattern with the ≈2500 sequences took ≈1 minute per pattern.

CONCLUSIONS

The PIDA approach to motif finding is designed to tackle a large class of problems which involve strongly conserved islands separated by arbitrary amounts of unconstrained sequence. PIDA is, for example, well-suited for the identification of amino acid patterns which correspond to high-contact-order domains in certain proteins. The constituent amino acids in these domains are packed close together in 3-dimensional space when the protein is folded, but they are not necessarily co-located within the amino acid sequence that defines the protein. Thus, in proteins with low sequence similarity, PIDA is capable of identifying these domains as they are characterized by regions of conservation separated by stretches of unrelated amino acid sequence. Simple alphabets like H and P (hydrophobic and hydrophilic) may be used versus the full 20 letter code. Analysis of such proteins using the implementation of this algorithm with variable alphabet sizes between two and 20 has yielded promising proof-of-principle results (Mannige, 2004).

PIDA has been designed to be as generally applicable as possible since there are a variety of sequence problems of this type, from transcription factor binding sites in DNA, to structural motifs in non-coding RNA, to high-contact-order domains in certain proteins. In particular, PIDA has no inherent restrictions on pattern parameters in terms of the size of islands, the number of islands, or the amount of water between islands. However, this implementation does include the facility to incorporate parameter restrictions so that the code can be tuned to best address the specific problem facing the user.

Furthermore, to maintain broad applicability, PIDA is designed to work on any user-specified alphabet. This provides more than just a facility for attacking both nucleic acid and amino acid motif finding problems. Choosing the appropriate alphabet is an essential aspect of the science of motif finding. Consider again the example of high-contact-order domains in certain proteins. Looking for protein motifs based on amino acid characteristics (e.g., using an alphabet which identifies amino acids by their physical properties such as charge or polarity) may yield patterns which would be invisible to a search using a larger alphabet due to mutations which preserve function without preserving amino acid identity.

Acknowledgments

C.P.’s and J.G.R.’s work was supported by training fellowships from the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia (NLM Grant No. 5T15LM07093). B.M.P. and J.G.R. acknowledges support from NIH and the Robert A. Welch Foundation. Ranjan Mannige is thanked for many interesting conversations about PIDA.

Footnotes

You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking: To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

References

Anwar M, Nguyen T, Turcotte M. Identification of consensus RNA secondary structures using suffix arrays. BMC Bioinformatics. 2006;7:244. doi: 10.1186/1471-2105-7-244. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology. 1994;2:28–36. [PubMed] [Google Scholar]
Henikoff S, Henikoff JG, Alford WJ, Pietrokovski S. Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene. 1995;163:GC17–GC26. doi: 10.1016/0378-1119(95)00486-p. [DOI] [PubMed] [Google Scholar]
Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. doi: 10.1093/bioinformatics/15.7.563. [DOI] [PubMed] [Google Scholar]
Ji Y, Xu X, Stormo GD. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned structures. Bioinformatics. 2004;20:1591–1602. doi: 10.1093/bioinformatics/bth131. [DOI] [PubMed] [Google Scholar]
Lenhard B, Wasserman WW. TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002;18:1135–6. doi: 10.1093/bioinformatics/18.8.1135. [DOI] [PubMed] [Google Scholar]
Li H, Rhodius V, Gross C, Siggia S. Identification of the binding sites of regulatory proteins in bacterial genomes. Proceedings of the National Academy of Science USA. 2002;99:11772–7. doi: 10.1073/pnas.112341999. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Wang JT, Hu J, Tian B. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics. 2005;6:89. doi: 10.1186/1471-2105-6-89. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mannige RV. P-PIDA: A Long-range Pattern Detection Algorithm for Proteins. University of Houston Batchelor’s Degree. Thesis. 2004;2004:1–44. [Google Scholar]
Marson L, Sagot MF. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Journal of Computational Biology. 2000;7:345–362. doi: 10.1089/106652700750050826. [DOI] [PubMed] [Google Scholar]
McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Research. 2001;29:774–82. doi: 10.1093/nar/29.3.774. [DOI] [PMC free article] [PubMed] [Google Scholar]
McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Research. 2000;10:744–57. doi: 10.1101/gr.10.6.744. [DOI] [PubMed] [Google Scholar]
Mwangi M, Siggia E. Genome wide identification of regulatory motifs in Bacillus subtilis. BMC Bioinformatics. 2003;4:18. doi: 10.1186/1471-2105-4-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neuwald AF, Liu JS, Lawrence CE. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science. 1995;4:1618–1632. doi: 10.1002/pro.5560040820. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pavesi G, Mauri G, Stefani M, Pesole G. RNAProfile: an algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucleic Acids Research. 2004;32:3258–3269. doi: 10.1093/nar/gkh650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quandt K, Frech K, Karas H, Wingender E, Werner T. Matlnd and Matlnspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Research. 1995;23:4878–84. doi: 10.1093/nar/23.23.4878. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics. 1998;14:55–67. doi: 10.1093/bioinformatics/14.1.55. [DOI] [PubMed] [Google Scholar]
Robison K, Manson McGuire A, Church M. A Comprehensive Library of DNA-binding Site Matrices for 55 Proteins Applied to the Complete Escherichia coli K-12 Genome. Journal of Molecular Biology. 1998;284:241–254. doi: 10.1006/jmbi.1998.2160. [DOI] [PubMed] [Google Scholar]
Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology. 1998;10:939–45. doi: 10.1038/nbt1098-939. [DOI] [PubMed] [Google Scholar]
Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proceedings of the International Conference on Intelligent Systems for Molecular Biology. 2000;8:344–54. [PubMed] [Google Scholar]
Thliveris AT, Mount DW. Genetic identification of the DNA binding domain of Escherichia coli LexA protein. Proceedings of the National Academy of Science USA. 1992;89:4500–4504. doi: 10.1073/pnas.89.10.4500. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005;23:137–144. doi: 10.1038/nbt1053. [DOI] [PubMed] [Google Scholar]
Vanet A, Marsan L, Labigne A, Sagot M. Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. Journal of Molecular Biology. 2000;297:335–353. doi: 10.1006/jmbi.2000.3576. [DOI] [PubMed] [Google Scholar]
Xing EP, Wu W, Jordan MI, Karp RM. Logos: a modular bayesian model for de novo motif detection. Journal of Bioinformatics and Computational Biology. 2004;2:127–154. doi: 10.1142/s0219720004000508. [DOI] [PubMed] [Google Scholar]
Yada T, Totoki Y, Ishikawa M, Asai K, Nakai K. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics. 1998;14:317–325. doi: 10.1093/bioinformatics/14.4.317. [DOI] [PubMed] [Google Scholar]
Yao Z, Weinberg Z, Ruzzo WL. CMfinder - a covariance model based RNA motif finding algorithm. Bioinformatics. 2006;22:445–452. doi: 10.1093/bioinformatics/btk008. [DOI] [PubMed] [Google Scholar]

[R1] Anwar M, Nguyen T, Turcotte M. Identification of consensus RNA secondary structures using suffix arrays. BMC Bioinformatics. 2006;7:244. doi: 10.1186/1471-2105-7-244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology. 1994;2:28–36. [PubMed] [Google Scholar]

[R3] Henikoff S, Henikoff JG, Alford WJ, Pietrokovski S. Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene. 1995;163:GC17–GC26. doi: 10.1016/0378-1119(95)00486-p. [DOI] [PubMed] [Google Scholar]

[R4] Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. doi: 10.1093/bioinformatics/15.7.563. [DOI] [PubMed] [Google Scholar]

[R5] Ji Y, Xu X, Stormo GD. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned structures. Bioinformatics. 2004;20:1591–1602. doi: 10.1093/bioinformatics/bth131. [DOI] [PubMed] [Google Scholar]

[R6] Lenhard B, Wasserman WW. TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002;18:1135–6. doi: 10.1093/bioinformatics/18.8.1135. [DOI] [PubMed] [Google Scholar]

[R7] Li H, Rhodius V, Gross C, Siggia S. Identification of the binding sites of regulatory proteins in bacterial genomes. Proceedings of the National Academy of Science USA. 2002;99:11772–7. doi: 10.1073/pnas.112341999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Liu J, Wang JT, Hu J, Tian B. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics. 2005;6:89. doi: 10.1186/1471-2105-6-89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Mannige RV. P-PIDA: A Long-range Pattern Detection Algorithm for Proteins. University of Houston Batchelor’s Degree. Thesis. 2004;2004:1–44. [Google Scholar]

[R10] Marson L, Sagot MF. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Journal of Computational Biology. 2000;7:345–362. doi: 10.1089/106652700750050826. [DOI] [PubMed] [Google Scholar]

[R11] McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Research. 2001;29:774–82. doi: 10.1093/nar/29.3.774. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Research. 2000;10:744–57. doi: 10.1101/gr.10.6.744. [DOI] [PubMed] [Google Scholar]

[R13] Mwangi M, Siggia E. Genome wide identification of regulatory motifs in Bacillus subtilis. BMC Bioinformatics. 2003;4:18. doi: 10.1186/1471-2105-4-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Neuwald AF, Liu JS, Lawrence CE. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science. 1995;4:1618–1632. doi: 10.1002/pro.5560040820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Pavesi G, Mauri G, Stefani M, Pesole G. RNAProfile: an algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucleic Acids Research. 2004;32:3258–3269. doi: 10.1093/nar/gkh650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Quandt K, Frech K, Karas H, Wingender E, Werner T. Matlnd and Matlnspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Research. 1995;23:4878–84. doi: 10.1093/nar/23.23.4878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics. 1998;14:55–67. doi: 10.1093/bioinformatics/14.1.55. [DOI] [PubMed] [Google Scholar]

[R18] Robison K, Manson McGuire A, Church M. A Comprehensive Library of DNA-binding Site Matrices for 55 Proteins Applied to the Complete Escherichia coli K-12 Genome. Journal of Molecular Biology. 1998;284:241–254. doi: 10.1006/jmbi.1998.2160. [DOI] [PubMed] [Google Scholar]

[R19] Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology. 1998;10:939–45. doi: 10.1038/nbt1098-939. [DOI] [PubMed] [Google Scholar]

[R20] Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proceedings of the International Conference on Intelligent Systems for Molecular Biology. 2000;8:344–54. [PubMed] [Google Scholar]

[R21] Thliveris AT, Mount DW. Genetic identification of the DNA binding domain of Escherichia coli LexA protein. Proceedings of the National Academy of Science USA. 1992;89:4500–4504. doi: 10.1073/pnas.89.10.4500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005;23:137–144. doi: 10.1038/nbt1053. [DOI] [PubMed] [Google Scholar]

[R23] Vanet A, Marsan L, Labigne A, Sagot M. Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. Journal of Molecular Biology. 2000;297:335–353. doi: 10.1006/jmbi.2000.3576. [DOI] [PubMed] [Google Scholar]

[R24] Xing EP, Wu W, Jordan MI, Karp RM. Logos: a modular bayesian model for de novo motif detection. Journal of Bioinformatics and Computational Biology. 2004;2:127–154. doi: 10.1142/s0219720004000508. [DOI] [PubMed] [Google Scholar]

[R25] Yada T, Totoki Y, Ishikawa M, Asai K, Nakai K. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics. 1998;14:317–325. doi: 10.1093/bioinformatics/14.4.317. [DOI] [PubMed] [Google Scholar]

[R26] Yao Z, Weinberg Z, Ruzzo WL. CMfinder - a covariance model based RNA motif finding algorithm. Bioinformatics. 2006;22:445–452. doi: 10.1093/bioinformatics/btk008. [DOI] [PubMed] [Google Scholar]

PERMALINK

PIDA:A new algorithm for pattern identification

C Putonti

BM Pettitt

JG Reid

Y Fofanov

Abstract

INTRODUCTION

METHODS