Table 2.
Operation | Probability | |
---|---|---|
Substitute | A → C | 0.31 |
C → G | ||
G → T | ||
T → A | ||
Substitute | A → T | 0.31 |
C → T | ||
G → A | ||
T → C | ||
Substitute | A → G | 0.31 |
C → A | ||
G → C | ||
T → G | ||
Insert | 1 N | 0.056 |
Insert | 2 N's | 0.0041 |
Insert | 3 N's | 0.0016 |
Insert | 4 N's | 0.00069 |
Insert | 5 N's | 0.00038 |
Insert | 6 N's | 0.00041 |
Insert | 7 N's | 0.00030 |
Insert | 8 N's | 0.00027 |
Insert | 9 N's | 0.00056 |
Insert | ≥ 10 N's | 0.0019 |
Insert | A | 0.001225 |
Insert | C | 0.001225 |
Insert | G | 0.001225 |
Insert | T | 0.001225 |
Delete | 0.0049 |
The probabilities are empirically estimated by aligning Chromosome 1 fragments from GAhum against the human reference (hg18). We encode the 12 possible substitutions using three symbols by using the reference. Insertions and deletions create an additional five symbols. To encode runs of Ns, we use distinct symbols up to 10. Here we use a single symbol for 10 or more Ns to get a lower bound on the entropy of the distribution.