Skip to main content
. 2011 Jul 24;27(18):2486–2493. doi: 10.1093/bioinformatics/btr421

Table 1.

Effects of the number of sequences on prediction results

No. of sequences Area under PPV–SEN curve
Bias SD No. of samples
95% credibility limit
Ensemble First cluster Second cluster First cluster Second cluster First+second cluster Ensemble First cluster Second cluster
2 0.44 0.46 0.37 0.27 0.04 728.13 150.76 878.89 0.21 0.14 0.11
3 0.58 0.59 0.49 0.20 0.03 793.15 124.94 918.09 0.14 0.10 0.07
4 0.58 0.58 0.48 0.20 0.03 791.66 115.00 906.66 0.14 0.09 0.06
5 0.62 0.63 0.51 0.17 0.03 802.20 113.24 915.44 0.12 0.08 0.05
6 0.67 0.67 0.54 0.16 0.03 800.50 111.66 912.16 0.11 0.07 0.05
7 0.70 0.69 0.57 0.15 0.03 795.52 111.92 907.44 0.10 0.07 0.05
8 0.73 0.71 0.60 0.15 0.03 797.56 116.19 913.75 0.10 0.07 0.04
9 0.73 0.73 0.60 0.14 0.02 790.59 122.38 912.97 0.09 0.06 0.04
10 0.75 0.74 0.63 0.13 0.02 792.85 125.11 917.96 0.09 0.06 0.04

For each row, we not only calculate the average area under the PPV–SEN curve for accuracy comparison, but also summarize the bias-variance statistics and the size of the two biggest clusters to visualize the clustering results. In order to normalize bias, SD and credibility limits with respect to the sequence length, we divide them by the average sequence length for the family.