Table 5.
Building blocks | N-grams | patterns | motifs | binary profiles | Top-n-gram-combine |
Numbers of "words" | 8000 | 8000 | 3231 | 1087 | 420 |
The combine suffix refers to Top-n-grams combining Top-1-grams and Top-2-grams. N-grams are the set of all possible subsequences of a fixed length 3, so the total words of N-grams are 8000 (203) [32]. Patterns are extracted by TEIRESIAS [38] and totally 71009 patterns are extracted [32]. Through χ2 selection [34], 8000 patterns are selected as the characteristic words [32]. The MEME/MAST system [40] is used to discover motifs and search databases. Totally, 3231 motifs are extracted [32]. The optimized probability threshold 0.13 is used to convert the protein sequence frequency profiles into binary profiles and 1087 words are obtained [28]. Top-1-grams and Top-2-grams have 20 and 400 words respectively, so the total words of Top-n-gram-combine are 420 (20+400).