TABLE 1.
Common ways of encoding DNA sequences.
Encoding method | Features |
Sequential encoding | This method encodes each base as a number. For example, change [A,T,G,C] to [0.25, 0.5, 0.75, 1.0], and any other character can be recorded as zero. |
One-hot encoding | This method is widely used in deep learning methods. For example, [A,T,G,C] will become [0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0]. These coded vectors can be connected or turned into a two-dimensional array. |
K-mer encoding | First take a longer biological sequence and decompose it into k-length overlapping fragments. For example, if we use a segment of length 6, “ATGCATGCA” will become: “ATGCAT,” “TGCATG,” “GCATGC,” “CATGCA.” |