Table 2.
Database | Alignment | Number of domains |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
One |
Two |
Three |
Four |
||||||||
A | Sp | Sn | A | Sp | Sn | A | Sp | Sn | A | ||
Homolog availability | |||||||||||
SCOP 1.73 (30%) | PS | 97 | 88 | 60 | 55 | 95 | 61 | 59 | 90 | 63 | 59 |
SCOP 1.73 (30%) | SS | 99 | 95 | 40 | 39 | 94 | 41 | 40 | 86 | 62 | 57 |
Number of proteins in RPS | |||||||||||
SCOP 1.65 (30%) | PS | 97 | 86 | 54 | 50 | 96 | 58 | 57 | 93 | 45 | 44 |
SCOP 1.69 (30%) | PS | 97 | 90 | 57 | 54 | 93 | 58 | 56 | 91 | 49 | 47 |
SCOP 1.73 (30%) | PS | 97 | 88 | 60 | 55 | 95 | 61 | 59 | 90 | 63 | 59 |
Maximum sequence identity in RPS | |||||||||||
SCOP 1.69 (20%) | PS | 97 | 86 | 43 | 41 | 90 | 42 | 40 | 71 | 19 | 17 |
SCOP 1.69 (30%) | PS | 97 | 90 | 57 | 54 | 93 | 58 | 56 | 91 | 49 | 47 |
SCOP 1.69 (40%) | PS | 97 | 91 | 67 | 63 | 92 | 66 | 62 | 93 | 56 | 54 |
A, accuracy; Sp, specificity; Sn, sensitivity. Alignment: PS profile-sequence, SS- sequence-sequence alignment. All values are percentages. Top: The availability of homology information for query sequences is simulated by using either the query profile (profile-sequence consistent with high availability) or the query sequence itself (sequence-sequence consistent with low availability) to search for identical fragments in the RPS. For multidomain proteins, the profile-sequence yields on average 13% higher overall accuracy, compared to the sequence-sequence alignment method. Middle: Every other version of the SCOP database, with 30% maximum sequence identity among the proteins, is used to study the effect of number of proteins in the RPS. The larger the size of the RPS (see Table 1 for the detailed breakdown in number of proteins and domain compositions), the higher is the average domain boundary prediction accuracy for multidomain proteins, presumably because the additional structure/sequence information uncovered as additional novel structures are added to the database. Bottom: Three simulations were conducted by experimenting with databases of three different maximum sequence identities among the reference proteins. The maximum sequence identity among the reference proteins varies from 20% to 40%.