Skip to main content
PLOS One logoLink to PLOS One
. 2013 Dec 4;8(12):e80660. doi: 10.1371/journal.pone.0080660

PMS: A Panoptic Motif Search Tool

Hieu Dinh 1, Sanguthevar Rajasekaran 1,*
Editor: Gajendra P S Raghava2
PMCID: PMC3851466  PMID: 24324619

Abstract

Background

Identification of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of great help in this identification. In this direction, many computational models for motifs have been proposed in the literature.

Methods

One such important model is the Inline graphic motif model. In this paper we describe a motif search web tool that predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the Inline graphic motif search problem.

Results

The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists.

Availability and requirements

Project name: PMS - Panoptic Motif Search Tool. Project home page: http://pms.engr.uconn.edu or http://motifsearch.com. Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.

Introduction

Motif search is an important problem in biology. Computational techniques could greatly help in solving this problem. A number of computational motif search tools can be found in the literature. See e.g., PRATT [1], MEME [2], DILIMOT [3], SLiMDisc [4], SLiMFinder [5] and FIRE-pro [6].

Each of the above tools is based on a specific model of motif search. An important model for motifs is the Inline graphic-motif search model. A simple version of this model can be stated as follows. We are given Inline graphic input sequences Inline graphic each of length Inline graphic. Input are also two integers Inline graphic and Inline graphic. The problem is to find a motif Inline graphic that is present in the Inline graphic input sequences. It is known that Inline graphic is of length Inline graphic and that it occurs in each of the Inline graphic input sequences within a Hamming distance of Inline graphic.

This model has been shown to yield better sensitivities than that of other models when tested on known biological data (see e.g., [7]). The problem of Inline graphic-motif search is intractable [8]. There are numerous algorithms that have been proposed for solving the Inline graphic-motif search problem. Examples are RISO and RISOTTO [9]. But RISO and RISOTTO are down-loadable programs and there are no corresponding web systems. In this paper we describe a web system for motif search that uses the Inline graphic-motif model. Our web system has the following features: 1) We employ several state of the art algorithms for Inline graphic-motif search. We can identify longer motifs than RISO and RISOTTO. RISO can only identify motifs of length up to 14. PMS can identify motifs of length up to 23; 2) Both DNA and protein motifs are supported; 3) We support quorum motif search. In this case the motif(s) need not occur in all the input sequences. Quorum motif search is significantly more difficult than the regular version [10]; 4) Dyads motifs are also found. In particular, the dyad motif under concern could consist of two segments separated by a gap; 5) We employ a scoring mechanism for the putative motifs found; and 6) The user interface for PMS is very friendly; 7) In PMS, user emails are optional.

To the best of our knowledge, there is no other comprehensive motif search system, based on the Inline graphic-motif model, comparable to ours.

Results

The PMS Webserver

The PMS server is freely available at http://pms.engr.uconn.edu or at http://motifsearch.com. The website is open to any user. Login is not required. However, any user with a login account will have the benefit of viewing and retrieving his or her submission(s) history. Also, a submission associated with a registered user will be kept in the system forever unless the user deletes it. Any submission from a user without a login account will be stored in the system for one month. It will be automatically removed after one month.

The purpose of the motif search tool is to help biologists identify novel motifs that may be present in input DNA and/or Protein sequences. Simple and user-friendly input forms will allow users to submit queries easily and quickly. Informative output and visualizations will permit users to analyze the results carefully. These features of the website are described in more detail in the following sections.

Input Sequences and Parameters

The input data can be either DNA or protein sequences. The length of each sequence is required to be between 15 and 1000. The number of input sequences is required to be between 5 and 500. The input sequences should be organized in the well-known text-based format - FASTA.

For each input dataset, a set of parameters will be chosen by the user. These parameters are shown in Figure 1. The first parameter is called “quorum percent” which is the minimum percentage of the input sequences that contain motifs. Quorum percent is set to 75% by default.

Figure 1. Parameters for DNA sequences.

Figure 1

The set of required parameters for DNA sequences. The first parameter is “quorum percent” which is the minimum percentage of the input sequences containing motifs. The second parameter allows users to choose the structure of motifs.

The second parameter allows users to choose the structure of motifs. Currently, the tool considers two structures, namely, monads and dyads. A monad is a contiguous string and a dyad consists of two segments separated by a gap. A monad is assumed by default. For monads, the users will choose the motif length. By default, the motif length is chosen to be “Any” which means that the tool will search for motifs of lengths between 10 and 25. If information about the motif length is known, we recommend that it be used to reduce the processing time. For dyads, users should choose the length of the first segment or box, the length of the second box and the length of the gap between the two boxes. If the lengths are chosen to be “Any”, processing will proceed similar to that for monads.

The third parameter is for DNA sequences that allows users to have the option of considering the reverse complement sequences. If the input DNA sequences have the same orientation, the third parameter should be chosen to be “No”. Otherwise, we recommend that it be chosen to be “Yes”.

Submitting Jobs

After entering the sequences and relevant parameters, the user clicks on the “Submit” button on the submission form. If the data entered are valid, the submission will enter the processing queue. Once the processing is over, a results page will be displayed. Information about the submission will appear on top of the results page as shown in Figure 2. Users can update contact email or change the parameters by clicking on either the “Update” button or the “Change parameters” button, respectively.

Figure 2. Query information.

Figure 2

Information about submission. Users can click on the “Update” button or the “Change parameters” button to update the contact email or change parameters.

After submission, the submission status could be one of these: in processing queue, being processed, and processed. If the submission has not been processed yet, the bottom of the results page will appear as shown in Figure 3. Users can click on the “Refresh” button to update the processing status. Users can either wait for their submission to be processed or bookmark the results page and return to it later. If the contact email is provided, the system will send a notification email when the submission is processed. The notification email will include the URL for the results page.

Figure 3. Result not available.

Figure 3

An example of the results page when the submission has not been processed yet. Users can click on the “Refresh” button to update the processing status.

The processing time of any submission varies from a few minutes to a few hours, depending on the data, the parameters, and the workload of the server. If the user feels that the tool is taking too much time to process, we recommend that (s)he provide his/her contact email. Providing emails has a number of benefits. The first benefit is that the user will receive email notifications when query processing is complete. The second benefit is that their submissions will be stored in the system as long as they want. The third and perhaps the most important benefit is that they can retrieve their submission histories (as discussed in the next section).

Output

Once the submission is processed, the bottom of the results page will appear as shown in Figure 4. Identified candidate motif(s) will appear on the left and the input sequences will appear on the right. If no motifs are found, we recommend to reduce the value of the quorum percent.

Figure 4. Result available - DNA sequences.

Figure 4

An example of the results page when the submission is processed. The locations of the second motif are marked on the DNA input sequences.

The candidate motif(s) found are ranked according to their scores. The score of a candidate motif is the logarithm of the probability that the motif occurs by random chance. The smaller the score, the more biologically significant the motif is. For more details on the scoring scheme, the readers are referred to [10]. For each candidate motif, users can click on the “View motif locations” button corresponding to the motif in order to view its locations, i.e., its instances, in the input sequences. The locations of the motif instances will be highlighted in the input sequences as shown in Figure 4. The probability weight matrix of the motif is directly calculated through its motif instances and will appear above the input sequences. The probability of a DNA character at each column in the probability weight matrix is its frequency when its motif instances are aligned. When a motif is chosen, users can click on the “Save motif locations in text” button to save its locations in a text file.

For input protein sequences, the results are shown in Figure 5 which is similar to that of DNA sequences except that the probability weight matrix is not shown because it would be large for protein sequences.

Figure 5. Result available - Protein sequences.

Figure 5

An example of the results page when the submission is processed. The locations of the second motif are marked on the protein input sequences.

Submissions History

The website allows users to easily manage their submission(s) history. To start the submissions history feature, click on the link “Submission history” on the left menu of the website. To view submissions history, enter the contact email and password on the submissions history form. If the password has not been set by a user yet, (s)he can go to the reset password form and enter the contact email. An email will be sent to the contact email including a URL that allows the user to reset the password.

The list of submissions will be shown as in Figure 6. Users can sort their submissions based on query ID, submission time, or processing status. If the users want to view a particular submission, they can click on the link “View detail” of the corresponding submission.

Figure 6. Submission history.

Figure 6

An example of the submissions history. Users can sort their submissions based on query ID, submission time, or processing status. Users can view a particular submission in detail by clicking on the link “View detail” of the according submission.

Feedback

The website supports an extensive feedback section. Users can easily submit feedbacks, comments, and questions using the feedback form. Feedbacks and comments will help us improve the website. To access the feedback form, click on the link “Feedback” on the left menu of the website.

Discussion

In this paper we have described a new web tool for motif search called PMS. This tool is based on the Inline graphic-motif search model. This is a comprehensive web tool offering many crucial features and we are not aware of any other computational motif search tool comparable to ours. In future we plan to support additional features. For example, we will identify candidate motifs with more than two segments (separated by gaps). Another important feature will be to score the candidate motifs based on experimental data publicly available. User feedbacks will also be taken into account in enhancing the features of our web tool PMS. We also plan to incorporate other motif models in future. In addition we plan to work on finding longer motifs.

Materials and Methods

Our online motif search tool is built on state-of-the-art algorithms for the most well-known motif model - Inline graphic-motif search or the Planted Motif Search (PMS). The PMS model has been shown to be very effective in identifying motifs (see e.g., [7]). The PMS Problem is defined as follows.

  • Definition 0.1 PMS Problem : given Inline graphic sequences and integer parameters Inline graphic and Inline graphic, find all strings Inline graphic of length Inline graphic such that Inline graphic appears in at least Inline graphic out of the Inline graphic given sequences within Inline graphic mutations. Each such string Inline graphic is a putative motif. Any Inline graphic-mer (i.e., a substring of length Inline graphic) Inline graphic in any input string such that the Hamming distance between Inline graphic and Inline graphic is at most Inline graphic is known as an instance of the motif Inline graphic.

The PMS Algorithms

In our web tool, we have used a combination of the current best PMS algorithms proposed in [10], [11], and [12].

We now summarize some of the techniques used in these algorithms.

Let Inline graphic stand for the Hamming distance between two strings Inline graphic and Inline graphic of the same length. Let Inline graphic be the given set of input sequences each of length Inline graphic. For simplicity, consider the version where Inline graphic. The PMS0 algorithm works as follows [13]: Consider Inline graphic. Let Inline graphic be an Inline graphic-mer of Inline graphic. Define the Inline graphic-neighborhood Inline graphic of Inline graphic to be the collection of all the Inline graphic-mers Inline graphic such that Inline graphic. If Inline graphic is an instance of an Inline graphic-motif Inline graphic, then, clearly Inline graphic will be in Inline graphic. However, we do not know which Inline graphic-mers of Inline graphic are instances of the motif we are looking for. Thus, PMS0 constructs Inline graphic for every Inline graphic-mer Inline graphic in Inline graphic. It then performs a union Inline graphic of all of these Inline graphic-neighborhoods. Inline graphic contains all the Inline graphic-motifs. For each Inline graphic-mer Inline graphic in Inline graphic, the algorithm checks if Inline graphic is an Inline graphic-motif or not in an obvious manner. Note that for a given Inline graphic-mer Inline graphic, we check if it is an Inline graphic-motif or not in Inline graphic time. A variation of this algorithm is called PMS1 and is described below [13]:

Algorithm PMS1

  1. Compute Inline graphic for each input sequence Inline graphic, Inline graphic. Here Inline graphic In other words, Inline graphic is nothing but the union of Inline graphic-neighborhoods of all the Inline graphic-mers in Inline graphic, Inline graphic. The notation Inline graphic indicates that the Inline graphic-mer Inline graphic is a substring in Inline graphic.

  2. The Inline graphic-motifs are now computed as Inline graphic.

Algorithm PMS5 can be thought of as an extension of PMS0 [11]. If Inline graphic is a collection of strings, let Inline graphic denote the Inline graphic-motifs present in Inline graphic. If the input sequences are Inline graphic, let Inline graphic and let Inline graphic. The idea of PMS5 is to compute the Inline graphic-motifs of Inline graphic as Inline graphic.

In order to compute Inline graphic for any Inline graphic-mer Inline graphic, the algorithm uses a subroutine to compute the common Inline graphic-neighborhood of three Inline graphic-mers. Specifically, let Inline graphic be any three Inline graphic-mers. We use Inline graphic to denote the common Inline graphic-neighborhood of Inline graphic, and Inline graphic. In other words, Inline graphic is nothing but the set of all Inline graphic-mers that are at a distance of no more than Inline graphic from each of the three Inline graphic-mers Inline graphic and Inline graphic.

To compute Inline graphic, PMS5 represents Inline graphic as a tree Inline graphic. Each node in this tree is an Inline graphic-mer in Inline graphic. The root of Inline graphic is the Inline graphic-mer Inline graphic. The depth of Inline graphic is Inline graphic. Inline graphic is traversed in a depth-first manner. Let Inline graphic be any node in this tree. During the traversal, Inline graphic will be output if Inline graphic is in Inline graphic. While visiting any node Inline graphic, we check if there is a descendent Inline graphic of Inline graphic such that Inline graphic is in Inline graphic. The subtree rooted at Inline graphic will be pruned if there is no such descendent. The problem of checking if Inline graphic has any descendent that is in Inline graphic is formulated as an integer linear program (ILP) on ten variables. This ILP is solved in Inline graphic time.

Any algorithm for solving the PMS problem when Inline graphic is typically named with a prefix of ‘q’. One of the first algorithms to address this version of the PMS problem was qPMSPrune [12]. Algorithm qPMSPrune is based on the following observation: If Inline graphic is any Inline graphic-motif of the input strings Inline graphic, then there exists an Inline graphic (with Inline graphic) and an Inline graphic-mer Inline graphic such that Inline graphic is in Inline graphic and Inline graphic is an Inline graphic-motif of the input strings excluding Inline graphic. The algorithm runs through every possible value of Inline graphic, Inline graphic. For a given value of Inline graphic, it considers every Inline graphic-mer Inline graphic of Inline graphic. Specifically, it constructs Inline graphic and identifies elements of Inline graphic that are Inline graphic motifs (with respect to input strings other than Inline graphic). Inline graphic is represented as a tree with Inline graphic as the root. This tree is traversed in a depth first manner and some pruning conditions are used to prune subtrees that do not have any motifs.

Algorithm qPMS7 of [10] extends the observation of qPMSPrune as follows: If Inline graphic is any Inline graphic-motif of the input strings Inline graphic, then there exist Inline graphic and Inline graphic-mer Inline graphic and Inline graphic-mer Inline graphic such that Inline graphic is in Inline graphic and Inline graphic is an Inline graphic-motif of the input strings excluding Inline graphic and Inline graphic. qPMS7 considers every possible pair Inline graphic, Inline graphic and Inline graphic. For a given pair Inline graphic, every possible pair of Inline graphic-mers Inline graphic is considered (where Inline graphic is from Inline graphic and Inline graphic is from Inline graphic). For a given Inline graphic and Inline graphic, the algorithm finds all the elements of Inline graphic that are Inline graphic motifs (with respect to input strings other than Inline graphic and Inline graphic). Inline graphic is explored by traversing an acyclic graph, denoted as Inline graphic. Inline graphic is traversed in a depth first manner. Here again effective pruning conditions are used to prune subgraphs of Inline graphic.

For more details about the PMS algorithms, the readers are referred to the respective papers.

An Experimental Validation of PMS Algorithms

Planted motif search is just one computational model for motifs. An important question is how efficient is this model in identifying motifs from real biological data. In fact the same question is relevant for any (computational or other) motif model. In [14], Tompa, et al. have evaluated the performance of 13 different motif finding programs: AlignACE, ANN-Spec, Consensus, GLAM, The Improbizer, MEME, MITRA, MotifSampler, Oligo/dyad-analysis, QuickScore, SeSiMCMC, Weeder and YMF. These programs were evaluated on several biological datasets (for which the motifs were known via experimental techniques) based on many different performance measures. Two of the performance measures employed were sensitivity and specificity. Sensitivity represents the fraction of sites that were correctly predicted and specificity represents the fraction of non-sites that were correct.

In [7], Sharma, et al. have evaluated the performance of PMS algorithms. In particular, they have employed the same 56 datasets that were used by Tompa, et al. [14]. As a result, Sharma, et al. have compared the PMS algorithms with the thirteen programs evaluated in [14]. Several versions of the PMS algorithms have been tested. One of these versions, namely, PMS SumMinD yields an average sensitivity of 28.8% and a specificity of 91.63% on all the 56 datasets. In comparison, the best of the 13 algorithms tested by Tompa, et al. [14], ANN-Spec, has an average sensitivity of 8.7% and a specificity of 98.22%.

Our Motif Search Framework

In addition to the PMS algorithms, we deploy a motif search framework that uses the PMS algorithms as underlying routines. The motif search framework basically works as follows. The user inputs a set of sequences that contain motifs of interest. The framework runs a PMS algorithm (qPMS7 as of now) with different triples of the parameters Inline graphic and collects all of the output motifs. These motifs are called candidate motifs. Then, it uses a score function that ranks the candidate motifs. The score function measures the significance of a candidate motif based on the probability that it occurs by random chance. Finally, the tool outputs the top 100 motifs with the highest scores. The score of a candidate motif will be high if the probability that it occurs by random chance is low.

Since the run time of PMS algorithms is exponentially dependent on the parameter Inline graphic, i.e. maximum number of mutations allowed, we let the user indirectly set the parameter through the computational preferences, “Quick Search” or “Full Search”. If the “Quick Search” option is chosen, then the parameter Inline graphic is set to a ‘low’ value (3, specifically). Conversely if the “Full Search” option is chosen, then the parameter Inline graphic is set to a higher value (7, specifically).

Identifying Motif Instances in the Input Sequences

Once a motif is found, its instances in the input sequences will be located as follows. For each input sequence, the location of the motif instance in the input sequence is the place where the motif matches the most. The motif location can be done easily by scanning through the entire input sequence.

Techniques to Identify Dyad Motifs

Eskin and Pevzner have presented an algorithm for finding dyads motifs [15]. This algorithm works as follows. Let the input sequences be Inline graphic and let the length of each sequence be Inline graphic. A dyad is characterized with the parameters Inline graphic. Here Inline graphic is the length of the first segment, Inline graphic is the length of the second segment, the length of the gap between the two segments can be in the range Inline graphic, and the dyad occurs in at least Inline graphic out of the Inline graphic sequences with a Hamming distance of at most Inline graphic. For each input sequence Inline graphic, the algorithm generates all the relevant Inline graphic-mers (where Inline graphic). Any such Inline graphic-mer will be such that its prefix of length Inline graphic will be an Inline graphic-mer in some input sequence Inline graphic, its suffix of length Inline graphic will be an Inline graphic-mer in the same sequence Inline graphic, the prefix occurs to the left of the suffix, and the length of the gap between the prefix and the suffix is in the range Inline graphic. Note that there are Inline graphic such Inline graphic-mers. Let Inline graphic be this collection of Inline graphic-mers. After having generated these Inline graphic-mers, they use the mismatch tree data structure to identify the Inline graphic-mers that correspond to valid dyads. In particular, any Inline graphic-mer will be output as a dyad if there is a Inline graphic-neighbor of this Inline graphic-mer that occurs in at least Inline graphic of the input sequences.

We speed up the above algorithm exploiting the PMS1 algorithm. The improvement works as follows. We generate the Inline graphic-mers for each sequence as in the algorithm of [15]. There are Inline graphic Inline graphic-mers for each sequence. Let Inline graphic be the collection of Inline graphic-mers from sequence Inline graphic, for Inline graphic. For each Inline graphic-mer of Inline graphic generate its Inline graphic-neighborhood (i.e., Inline graphic-mers that are within a Hamming distance of Inline graphic from the Inline graphic-mer), for Inline graphic. Let Inline graphic be the collection of Inline graphic-neighbors of all the Inline graphic-mers of Inline graphic, for Inline graphic. We can output Inline graphic-neighbors that are in at least Inline graphic of these collections. One way of finding such Inline graphic-mers will be with the help of hashing. Another way is to make use of integer sorting. For example, we can sort each Inline graphic (for Inline graphic), merge these sorted lists, and go through the merged list to count the number of sequences each such Inline graphic-neighbor occurs in.

Availability and Requirements

Project name: PMS - Panoptic Motif Search Tool. Project home page: http://pms.engr.uconn.edu or http://motifsearch.com. Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.

Funding Statement

This work has been supported in part by the following grants: NSF 0829916 and NIH R01-LM010101. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Jonassen I, Collins J, Higgins D (1995) Finding exible patterns in unaligned protein sequences. Protein Science 4: 1587–1595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Bailey TL, Boden M, Buske FA, Frith M, vGrant CE, et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research 37: W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Neduva V, Russell R (2006) DILIMOT: discovery of linear motifs in proteins. Nucleic Acids Re- search 34: W350–W355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Davey NE, Shields DC, Edwards RJ (2006) SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Research 34: 3546–3554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Edwards RJ, Davey NE, Shields DC (2007) SLiMFinder: A probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE 2: e967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Lieber DS, Elemento O, Tavazoie S (2010) Large-scale discovery and characterization of protein regulatory motifs in eukaryotes. PLoS ONE 5: e14444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sharma D, Rajasekaran S, Dinh H (2011) An experimental comparison of PMSPrune and other algorithms for motif search. CoRR abs/1108.5217. [DOI] [PubMed]
  • 8. Rajasekaran S (2009) Computational techniques for motif search. Frontiers in Bioscience 14: 5052–5065. [DOI] [PubMed] [Google Scholar]
  • 9.Pisanti N, Carvalho AM, Marsan L, Sagot MF (2006) RISOTTO: Fast extraction of motifs with mismatches. Proceedings of the 7th Latin American Theoretical Informatics Symposium: 757–768.
  • 10. Dinh H, Rajasekaran S, Davila J (2012) qPMS7: A fast Algorithm for finding (l; d)-motifs in DNA and protein sequences. PLoS ONE 7(7): e41425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dinh H, Rajasekaran S, Kundeti V (2011) PMS5: an efficient exact algorithm for the (l; d)-motif finding problem. BMC Bioinformatics 12(410). [DOI] [PMC free article] [PubMed]
  • 12.Davila J, Balla S, Rajasekaran S (2007) Fast and practical algorithms for planted (l; d) motif search. IEEE/ACM Transactions on Computational Biology and Bioinformatics: 544–552. [DOI] [PubMed]
  • 13. Rajasekaran S, Balla S, Huang CH (2005) Exact algorithms for planted motif challenge problems. Journal of Computational Biology 12(8): 1117–1128. [DOI] [PubMed] [Google Scholar]
  • 14. Tompa M, Li N, Bailey TL, Church GM, Moor BD, et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1): 137–144. [DOI] [PubMed] [Google Scholar]
  • 15. Eskin E, Pevzner P (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics S1: 354–363. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES