. 2011 Jul 24;27(18):2486–2493. doi: 10.1093/bioinformatics/btr421

Table 1.

Effects of the number of sequences on prediction results

No. of sequences	Area under PPV–SEN curve			Bias	SD	No. of samples			95% credibility limit
	Ensemble	First cluster	Second cluster			First cluster	Second cluster	First+second cluster	Ensemble	First cluster	Second cluster
2	0.44	0.46	0.37	0.27	0.04	728.13	150.76	878.89	0.21	0.14	0.11
3	0.58	0.59	0.49	0.20	0.03	793.15	124.94	918.09	0.14	0.10	0.07
4	0.58	0.58	0.48	0.20	0.03	791.66	115.00	906.66	0.14	0.09	0.06
5	0.62	0.63	0.51	0.17	0.03	802.20	113.24	915.44	0.12	0.08	0.05
6	0.67	0.67	0.54	0.16	0.03	800.50	111.66	912.16	0.11	0.07	0.05
7	0.70	0.69	0.57	0.15	0.03	795.52	111.92	907.44	0.10	0.07	0.05
8	0.73	0.71	0.60	0.15	0.03	797.56	116.19	913.75	0.10	0.07	0.04
9	0.73	0.73	0.60	0.14	0.02	790.59	122.38	912.97	0.09	0.06	0.04
10	0.75	0.74	0.63	0.13	0.02	792.85	125.11	917.96	0.09	0.06	0.04

For each row, we not only calculate the average area under the PPV–SEN curve for accuracy comparison, but also summarize the bias-variance statistics and the size of the two biggest clusters to visualize the clustering results. In order to normalize bias, SD and credibility limits with respect to the sequence length, we divide them by the average sequence length for the family.