Subsequence frequency predictions. (A) Predicted subsequence frequencies for a set of seven positions known to be important for kinase activity, compared to the data set frequencies. The Potts distribution (top) models the observed distribution well, in contrast to the independent model (bottom). (B) Average correlation between observed and predicted frequencies for the top 20 subsequences for large samples of subsequences of varying length, for observed subsequence frequencies with the Potts model (blue), and with the independent model (red, dotted). Circles show the means, and error bars show the range of first to third quartile values (25–75% of sets of positions). The dashed line (black) is an estimate of the expected correlation due only to finite sampling, computed by comparing the subsequence frequencies of a finite synthetic data set MSA of size 8149 to the frequencies of a large MSA of sequences generated from a Potts model fitted to the synthetic MSA of size 8149. Both the trend and range of the expected correlations due to the effects of the sample size (8149) are consistent with the correlation between the observed frequencies and those predicted by the Potts model. To see this figure in color, go online.