Table 1.
Length | Window | Tokenized | Vectorization |
---|---|---|---|
3 | 3 | ATC GCG TAC GAT CCG | 0321 3412 4532 4214 |
4 | 4 | ATCG CGTA CGAT | 0123 3412 4532 |
5 | 5 | ATCGC GTACG ATCCG | 4124 5124 2134 |
4 | 2 | ATCG CGCG CGTA TACG CGAT ATCC | 2563 3124 4236 3578 2145 |
4 | 3 | ATCG GCGT TACG GATC | 4252 5134 2136 3451 2411 |
It shows DNA sequence ‘ATCGCGTACGATCCG’ is cut into multiple different k-mers and his vector when the length is (3,4,5,4,4) and the window is (3,4,5,2,3).