Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2012 Aug 1.

Published in final edited form as: Proteins. 2011 May 31;79(8):2389–2402. doi: 10.1002/prot.23049

Predicting functional categories (based on Pfam database) by sequence-based clustering in five large topologies defined by CATH database. Blast, PSI-Blast and FFAS were used to perform all-to-all alignment of sequences and then clusters were calculated with single linkage algorithm using different sequence identity cutoffs. The resulting clusters were compared with Pfam families. The number of Pfam families ‘split’ between different sequence-based clusters and the number of Pfam families clusters ‘merged’ into the same sequence-based clusters (Y-axes) are shown as function of sequence identity cutoff used in calculations (X-axes). The number of ‘merged’ Pfam families is decreasing with increasing sequence identity cutoff and the number of ‘split’ Pfam families is increasing. The intersection of these two curves corresponds to the most accurate prediction of Pfam families by sequence-based clusters. Y coordinate of this point provides an assessment of the agreement between sequence-based clusters and Pfam families and allows comparison of the accuracy of different sequence clustering algorithms (see Methods section for more details). Results obtained with sequence identity normalized by query sequence length (globally normalized sequence identity) and by alignment length (standard sequence identity) were shown as continuous and dashed curves, respectively.