Four different ways to encode the sgRNA nucleotide sequence. Demonstrated are the four encircled nucleotides (CATA): (A) as a string—this will not be compatible with many ML algorithms; (B) as a list of characters—here, each nucleotide has its own ‘feature’. However, many ML algorithms require features to be represented as numbers; (C) as a list of numbers—here, each nucleotide has been arbitrarily assigned a value from 0 to 3. However, algorithms that accept ‘continuous’ features will consider T (3) to be more different from A (0), than T is from G (2) because of the larger difference in the arbitrarily assigned values; (D) one-hot encoded—here, each nucleotide is represented as four list elements. One (and only one) of these elements is ‘hot’ (i.e 1) depending on the nucleotide. In this example, the first element being hot, i.e. [1, 0, 0, 0], represents an A. In this representation, all nucleotides are represented as being equally different.