Skip to main content
. 2011 Oct 24;12:409. doi: 10.1186/1471-2105-12-409

Figure 1.

Figure 1

A toy example to illustrate the encoding schemes for protein sequences. Given a toy sequence of two letters, k-mer based methods, denoted by K, count the number of each k-mer in the sequence. Here k = 2. The counting process is represented as a matrix in which the rows represent the first letter of 2-mers and the columns represent the second letter of 2-mers. The dimension of the resultant vector is 22 = 4. If k = 3, the dimension will be 23 = 8. For real protein sequences, the dimension will be 203 = 8, 000. Segmentation based methods, denoted by P, divide the sequence evenly into p pieces first and then count the number of each letter in each piece. Here p = 2. The dimension of the resultant vector is 2*2 = 4. If p = 3, the dimension will be 2*3 = 6. For real protein sequence, the dimension will be 20*3 = 60. Quantile based methods, denoted by Q, record the positions of q quantiles of instead of the number of each letter. Here q = 2 and the first and the median positions of each letter are recorded.