For each motif that it discovers in the training set,
MEME prints the following information:
-
Summary Line
This line gives the width (`width'),
number of occurrences in the training set (`sites'), log likelihood
ratio (`llr') and E-value of the motif.
Each motif describes a pattern of a fixed width--no gaps are allowed in
MEME motifs.
MEME numbers the motifs consecutively from one as it finds them.
MEME usually finds the most statistically significant (low E-value)
motifs first.
The statistical significance of a motif is based on its log likelihood ratio,
its width and number of occurrences, the background letter frequencies
(given in the command line summary), and
the size of the training set. The E-value is an
estimate of the expected number of motifs with the given log likelihood
ratio (or higher), and with the same width and number of occurrences,
that one would find in a similarly sized set of random sequences.
(In random sequences each position is independent with letters chosen
according to the background letter frequencies.) The log likelihood
ratio is the logarithm of the ratio of the probability of the occurrences
of the motif given the motif model (likelihood given the motif)
versus their probability given the background model (likelihood given the
null model). (Normally the background model is a 0-order Markov model
using the background letter frequencies, but higher order Markov models
may be specified via the -bfile option to MEME.)
Clicking on the buttons to the left of the motif summary line
takes you to the previous motif (P) or next motif (N).
-
Simplified
Position-Specific Probability Matrix
MEME motifs are represented by position-specific probability matrices
that specify the probability of each possible letter appearing at each
possible position in an occurrence of the motif. In order to make it easier
to see which letters are most likely in each of the columns of the
motif, the simplified motif shows the letter probabilities multiplied by 10
rounded to the nearest integer ("a" means 10). Zeros are replaced by ":"
(the colon) for readability.
-
Information Content Diagram
The information content diagram provides
an idea of which positions in the motif are most highly conserved.
Each column (position) in a motif can be characterized by the amount of
information it contains (measured in bits). Highly conserved positions
in the motif have high information; positions where all letters are equally
likely have low information. (The information content is relative to
the background letter frequencies which are given in the
command line summary section.)
The diagram is printed so that each column lines up with the same column in
the simplified position-specific probability matrix above it.
Columns in the information content diagram are colored according to the
majority category of the letters occurring in that column of the alignment.
If no letter category has frequency above 0.5, the column in the diagram
is colored black. For DNA sequences, the letter categories contain one letter
each. For proteins, the categories are based on the biochemical properties
of the various amino acids. The categories and their colors are:
NUCLEIC ACIDS | COLOR |
A |
RED |
C |
BLUE |
G |
ORANGE |
T |
GREEN |
AMINO ACIDS | COLOR |
PROPERTIES |
A, C, F, I, L, V, W and M |
BLUE |
Most hydrophobic[Kyte and Doolittle, 1982] |
NQST |
GREEN |
Polar, non-charged, non-aliphatic residues |
DE |
MAGENTA |
Acidic |
KR |
RED |
Positively charged |
H |
PINK |
G |
ORANGE |
P |
YELLOW |
Y |
TURQUOISE |
J. Kyte and R. Doolittle, 1982.
"A Simple Method for Displaying the Hydropathic Character of a Protein",
J. Mol Biol. 157, 105-132.
Summing the information content for each position in the motif gives
the total information content of the motif (shown in parentheses to the
left of the diagram). The total information content is approximately
equal to the log likelihood ratio divided by the number of occurrences times
ln(2).
The total information content gives a measure of
the usefulness of the motif for database searches.
For a motif to be useful for database searches, it must as a rule contain at
least log_2(N) bits of information
where N is the number of sequences in the database being searched.
For example, to effectively search a database containing 100,000 sequences
for occurrences of a single motif, the motif should have an IC of at
least 16.6 bits. Motifs with lower information content are still useful when a
family of sequences shares more than one motif since they can be combined
in multiple motif searches (using MAST).
-
Multilevel
Consensus Sequence
The multilevel consensus sequence corresponding to the motif is an aid in
remembering and understanding the motif. It is calculated from the motif
position-specific probability matrix as follows.
Separately for each column of the motif,
the letters in the alphabet are sorted in decreasing order by the probability
with which they are expected to occur in that position of motif occurrences.
The sorted letters are then printed vertically with the most probable letter
on top. Only letters with probabilities of 0.2 or higher at that position in
the motif are printed. As an example, the multilevel consensus sequence of
motif 1 in the sample output is:
Multilevel TTATGTGAACGACGTCACACT
consensus AA T A G A GA AA
sequence T C TT T
This multilevel consensus sequence says several things about the motif.
First, the most likely form of the motif
can be read from the top line as
TTATGTGAACGACGTCACACT.
Second, that only letter A has probability more than 0.2 in
position 3 of the motif, both T and A have probability
greater than 0.2 in position 1, etc.
Third, a rough approximation of the motif can be made by converting the
multilevel consensus sequence into a
regular expression for the motif.
[TA][TA]AT[GT][T][GA]A[AGT]C[GAC]A[CGT][GAT]TCACA[CAT][TA]
-
Occurrences of the Motif
MEME displays the occurrences (sites) of the motif in the training set.
The sites are shown aligned with each other, and the ten sequence
positions preceding and following each site are also shown.
Each site is identified by the name of the sequence where it occurs,
the strand (if both strands of DNA sequences are being used), and the
position in the sequence where the site begins. When the DNA strand
is specified, `+' means the sequence in the training set,
and `-' means the reverse complement of the training set sequence.
(For `-' strands, the `start' position is actually the position on the
positive strand where the site ends.)
The sites are listed in order of increasing statistical significance
(p-value). The p-value of a site is computed from the the match
score of the site with the position specific scoring matrix
for the motif. The p-value gives the probability of a random string
(generated from the background letter frequencies) having the same match
score or higher. (This is referred to as the position p-value
by the MAST algorithm.)
-
Block Diagrams of Motif Occurrences
The occurrences of the motif in the training set sequences are
shown with MAST-style block diagrams. One diagram is printed for each
sequence showing all the occurrences of the motif in that sequence.
The sequences are sorted by the lowest p-value among all
occurrences of the motif in a given sequence.
(The p-value of an occurrence is the probability of a single
random subsequence the length of the motif,
generated according to the 0-order background model, having a score
at least as high as the score of the occurrence.)
When the DNA strand is specified, `+' means the motif appears from left to
right on the sequence, and `-' means the motif appears from right to left
on the complementary strand.
A sequence position scale is shown at the end of each table of block
diagrams. Very long sequences are shown with thick lines connecting the
motifs and are not drawn to scale.
-
Motif in BLOCKS format or FASTA format
For use with
BLOCKS tools,
MEME prints the occurrences of the motif in BLOCKS format.
You can convert these blocks to
PSSMs (position-specific scoring matrices), LOGOS (color representations
of the motifs), phylogeny trees and search them against a database of other
blocks by pasting everything from the "BL" line to the "//" line (inclusive)
into the
Multiple Alignment Processor.
If you include the -print_fasta switch on the command line, MEME prints
the motif sites in FASTA format instead of BLOCKS format.
-
Position-Specific Scoring Matrix
The position-specific scoring matrix corresponding to the motif is printed
for use by database search programs such as MAST. This matrix is a
log-odds matrix calculated
by taking 100 times the log (base 2) of the ratio p/f at each position in
the motif where p is the probability of a particular letter at that
position in the motif, and f is the background frequency of the
letter (given in the command line summary section.)
This is the same matrix that is used above in computing the p-values
of the occurrences of the motif in the Occurrences of the
Motif and Block Diagrams of Motif Occurrences
sections.
The scoring matrix is printed "sideways"--columns
correspond to the letters in the alphabet (in the same order as shown in
the simplified motif) and rows corresponding to the positions of the motif,
position one first. The scoring matrix is preceded by a line starting with
"log-odds matrix:" and containing the length of the alphabet, width
of the motif, number of characters in the training set, the scoring
threshold (obsolete) and the motif E-value.
Note: The probability p used to compute the PSSM
is not exactly the same as the corresponding value in the
Position Specific Probability Matrix (PSPM).
The values of p used to compute the PSSM take
into account the motif prior, whereas the values in the PSPM are just
the observed frequencies of letters in the motif sites.
-
Position-Specific Probability Matrix
The motif itself is a position-specific probability matrix giving,
for each position in the pattern, the observed frequency
("probability") of each possible letter.
The probability matrix is printed "sideways"--columns
correspond to the letters in the alphabet (in the same order as shown in
the simplified motif) and rows corresponding to the positions of the motif,
position one first.
The motif is preceded by a line starting with
"letter-probability matrix:" and containing the length of the alphabet, width
of the motif, number of occurrences of the motif, and the E-value of the
motif.
Note: Earlier versions
of MEME gave the posterior probabilities--the probability after applying
a prior on letter frequencies--rather than the observed frequencies.
These versions of MEME also gave the number of possible
positions for the motif rather than the actual number of occurrences.
The output from these earlier versions of MEME can be distinguished
by "n=" rather than "nsites=" in the line preceding the matrix.
-
Regular Expression
This is the multilevel consensus expressed as
a regular expression for convenience. Regular expressions can
be used for searching for against sequences (using, for example,
PatMatch)
but the search accuracy will usually be better with the PSSM (using,
for example
MAST.)
MEME regular expressions are interpreted as follows:
single letters match that letter; groups of letters in square brackets
match any of the letters in the group.
-
Motif Summary Tiling
The motif summary tiling is done using the same algorithm as used
by MAST.
The motif occurrences shown in the motif summary
may not be exactly the same as those reported in each motif section
because only motifs with a position p-value of 0.0001 that
don't overlap other, more significant motif occurrences are shown.
The format of the machine readable motif-summary is:
[sequence_name combined_p-value number_of_motif_occurrences [motif_number start_of_motif position_p-value]+]+
See the documentation for
MAST output for the definition of position and
combined p-values.