Table 2. Top ten mutations for each RC dataset according to weights derived from the initial linear SVR.
rank | Monogram mutation | influence | Erlangen rank | Erlangen mutation | influence | Monogram rank |
1 | RT M184V | dec. | 19 | RT Q207E | inc. | 240 |
2 | PR K43T | dec. | 568 | PR V82A | dec. | 127 |
3 | RT A158S | dec. | 126 | RT Y181C | inc. | 150 |
4 | PR Q92R | dec. | 401 | RT T215Y | dec. | 18 |
5 | PR I64L | dec. | 886 | RT K20I | inc. | 49 |
6 | PR K55R | dec. | 602 | PR I13V | dec. | 132 |
7 | PR E34K | dec. | 483 | RT E122K | inc. | – |
8 | PR I47V | dec. | 366 | RT L74V | inc. | 141 |
9 | PR V32I | dec. | 131 | RT S162C | inc. | 255 |
10 | PR P39S | dec. | 141 | RT T39E | dec. | 267 |
Along with a mutation, its influence on RC compared to the wild-type is listed – “dec.” for “decreasing”, “inc.” for “increasing” – as well as its position in the feature ranking of the other dataset. With the exception of RT A158S, PR I64L, PR P39S, RT Q207E, RT E122K, RT S162C, and RT T39E, all of theses mutations are known to be associated with HIV drug resistance and/or fitness [20], [31]. In total, the two feature rankings consist of 878 mutations from the Monogram dataset and 1018 mutations from the Erlangen dataset; the difference is mainly due to the fact that fewer sequence positions are included in the Monogram genotypes. Note that the mutation RT E122K does not occur in the Monogram ranking. In the Monogram dataset, lysine (K) – not the wild-type glutamic acid (E) – is the consensus amino acid at position 122 of the RT sequence, so that E122K was removed from the training dataset in the input coding phase. The clear dominance of RC-decreasing mutations in the Monogram dataset may be partly due to the stronger bias towards low-RC samples in this dataset (median measured RC of 38.45%, compared to 46.47% in the Erlangen dataset; see also Figure 1).