Table 1.
Test set | Method | Tolerant prediction accuracy | Deleterious prediction accuracy | Total prediction accuracy | Experimental prediction accuracy |
---|---|---|---|---|---|
LacI* n = 4004 | SIFT | 78% (1747/2254) | 57% (989/1750) | 68% (2736/4004) | 66% (989/1496) |
BLOSUM62 | 31% (696/2254) | 84% (1475/1750) | 54% (2171/4004) | 49% (1475/3033) | |
HIV-1 Protease n = 336 | Automated SIFT | 70% (78/111) | 82% (184/225) | 78% (262/336) | 85% (184/217) |
SIFT without RSV, avian sequences | 68% (75/111) | 88% (197/225) | 81% (272/336) | 85% (197/233) | |
BLOSUM62 | 63% (70/111) | 73% (165/225) | 70% (235/336) | 80% (165/206) | |
Bacteriophage T4 | SIFT | 59% (817/1377) | 72% (460/638) | 63% (1277/2015) | 45% (460/1020) |
Lysozyme n = 2015 | BLOSUM62 | 30% (406/1377) | 85% (542/638) | 47% (948/2015) | 36% (542/1513) |
The effect of 4004 substitutions was assayed for LacI (Markiewicz et al. 1994; Pace et al. 1997), 336 substitutions for HIV-1 protease (Loeb et al. 1989), and 2015 substitutions for bacteriophage T4 lysozyme (Rennell et al. 1991). These three data sets are used to test prediction performance. Tolerant prediction accuracy is the number of substitutions correctly predicted to have no effect divided by the total number of substitutions that gave a wild-type phenotype under experimental test conditions. Subtracting the numerator from the denominator gives the number of substitutions that have been predicted to be deleterious but gave a wild-type phenotype under experimental conditions. Deleterious prediction accuracy is the number of substitutions correctly predicted to have an effect on the protein divided by the number of substitutions that affected protein. Subtracting the numerator from the denominator gives the number of substitutions that were predicted to have wild-type phenotype but gave a deleterious phenotype under experimental conditions. Total prediction accuracy is the total number of substitutions correctly predicted divided by the total number of substitutions. Experimental prediction accuracy is the number of substitutions that were experimentally shown to affect protein function divided by the number of substitutions predicted to affect function. For the biologist investigating substitutions predicted to have a deleterious effect, the experimental prediction accuracy reflects the proportion of predictions that will yield affected phenotypes experimentally.
SIFT offers prediction for positions 5–329 of the LacI repressor because fewer than half of the sequences are represented at positions 1–4 and 330–360.