Contacts are defined based on amino acid specific Cβ-Cβ distance cutoffs as described in SI Table 3 in
Kamisetty et al. (2013). (
A) Observed vs predicted accuracies over a large data set of proteins of known structure with deep alignments (
Supplementary file 3), sub sampled to different extents (seq/√(len) = 4 (red), 8 (green), 15 (purple), 32 (cyan), and 96 (orange)). Circles represent observed contact prediction accuracies, solid lines, a fit to a sigmoid function of the normalized coupling value, the number of sequences, the length, and the sequence separation (see
Figure 16—figure supplement 1 and Figure 16—figure supplement 2). (
B) Observed vs predicted accuracies in an independent data set of variable length alignments for 7047 pdb chains (
Supplementary file 3), using maximum number of sequences obtained with HHblits as opposed to subsampling a large alignment. Circles again represent observed contact prediction accuracies; solid lines, the predicted accuracy using the model obtained by fitting to the data in (
A). The contact prediction accuracy is correctly modeled for the independent data set, justifying its use on the unknown cases described in this article. The Equation use to calculate P(contact|sco,seq,len,sep) is