Zhang et al. 10.1073/pnas.0409240102.

Supporting Information

Files in this Data Supplement:

Supporting Text
Supporting Table 2

Supporting Text

To test our method, we randomly generate K patterns of various lengths and insert mutated copies of each pattern into N random sequences of length L. The number of copies inserted in each sequence follows the Poisson distribution. To test the performance of our method for various sequence size, pattern lengths, and mutations, we generate three different patterns: m₁ = 63 bps, m₂ = 42 bps, and m₃ = 21 bps. Each sequence contains in average one copy of each pattern. Table 1 shows the result.

We observe that most false negatives are from the short pattern of 21 bps (m₃), which are hard to distinguish from random hits. When the copy number of a pattern is small and each copy is degenerate, the consensus could be shorter than the original pattern (a case marked by * in Table 1), and thus the pattern needs to be fully recovered by more than one consensus. This is a common issue in finding degenerate conservation.

One significant observation is that most constructed consensus patterns by our method are identical to the original patterns distributed among sequences before mutation. Only a few random letters flank both ends of the consensus due to the loose threshold in our method. Consequently, one can use the consensus sequence output by our method as a query sequence in other programs, such as BLAST, to search for similar segments in databases. BLAST would allow finding partial motifs as contrasted with motif finding. Our experience is, however, that using methods like BLAST may lose sensitivity on degenerate patterns as a trade off to the fast search process.

Table 2. Local multiple alignment simulation

Sequences			Patterns inserted				Patterns found
N	L, K	Identity, %	Total	m₁	m₂	m₃	Total	m₁	m₂	m₃	FP	FN
10	2	90	27	10	10	7	26	10	10	6	0	1
30	2	90	77	33	19	25	76	33	19	24	0	1
10	20	90	33	10	9	14	29	10	9	10	0	4
30	20	90	82	26	24	32	80	26	24	30	0	2
10	2	80	33	8	8	17	31	8	8	15	0	2
30	2	80	92	29	28	35	87	29	28	30	0	5
10	20	80	22	6	5	11	19	5*	5*	9	0	3
30	20	80	81	38	20	23	76	38	20	18	0	5

N, no. of sequences; L, average sequence length; Identity, pairwise identity between conserved regions; FP, no. of false positives; FN, no. of false negatives. Patterns marked by * are recovered by two shorter consensuses. The scoring function is match = 1; mismatch = –1; gap open = –3; gap extension = –1.