Abstract
PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure–function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.
INTRODUCTION
The number of protein sequences and structures deposited in databanks (1,2) indicate an ever-increasing gap between protein sequence and structural information (3,4) which is further amplified due to genome sequencing projects. Homologous proteins share a high degree of sequence, structural and functional similarity (5–9). Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles (10–17), can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences.
SCOP [Structural Classification of Proteins (17)] is a dictionary of protein structural entries organised at different hierarchies of structural and functional similarities. SCOP (1.53 release) records 26 219 protein domains which are grouped into merely 564 folds, suggesting a strong structural convergence of proteins. CAMPASS (18) forms the first version of a protein superfamily database, corresponding to 69 superfamilies, which records the alignments of proteins aligned using COMPARER (19). Availability of such alignment databases over the World Wide Web offers the possibility to study and design experiments on specific superfamilies; they also permit systematic survey and analysis of various structural properties and perform fold predictions. In addition, the construction of three-dimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are ≥30%. Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distant related proteins. We report the semi-automated updated version of the superfamily alignment database which has been designed to be in direct correspondence with SCOP database [1.53 release (17)].
FEATURES IN PASS2
PASS2 superfamily alignments are in concordance with SCOP definitions of superfamily classification and domain boundaries, as opposed to CAMPASS (18) where there could be differences in grouping protein domains and results from DIAL (20) were employed to consider alternate domain boundaries. This was decided to facilitate the automation processes. Nearly 600 single-member superfamilies have been included in PASS2. The superposed set of protein co-ordinates can be viewed using graphic interface softwares such as RASMOL (21). A rough structural phylogeny, corresponding to root mean square deviations (r.m.s.d.) at structurally equivalent positions amongst superfamily members, is provided. Sequence search engines, both in the form of text strings and alignment using PSI-BLAST, have been introduced in PASS2 to provide a user-friendly access to the database. It is also possible to obtain augmented superfamily alignments including the query sequence or homologues from genome databases. PASS2 is connected with 72 genome databases of model organisms enabling access to homologous gene products. Structure-based sequence alignments and superposed protein structure co-ordinates corresponding to superfamily alignments are extractable as in CAMPASS (18). In addition, PASS2 offers the possibility to download accessory structural files of individual superfamily members such as solvent accessibility, hydrogen bonding and secondary structural data. Links to other databases, such as CATH (22,23), FSSP (24), PFAM (25), PALI (26), 3Dee (27,28), PROMOTIF (29), PROCHECK (30,31), SYSTERS (32), both to the individual entries and to the main home pages are provided from PASS2. Tables 1 and 2 provide a list of links and useful data downloadable by the user across the World Wide Web.
Table 1. Links in superfamily alignment database: main links for each superfamily.
Superfamily | Structure-based sequence alignment of the superfamily in various formats | PIR | |
HTML | |||
Colour Postscript format | |||
Latex | |||
Superposed coordinates of all the members within a superfamily | Superposed Coordinates | ||
Alignment of query sequence with all the superfamily members and homologous sequence searches against the superfamily database | ALIGN | (I) MALIGN, (II) JOY | |
(I) PSI-BLAST, (II) PHI-BLAST | |||
Genome distribution for each superfamily against 72 genome databases are available | Genome Distribution | ||
Representation of structural motifs for each superfamily will be available | Structural Motifs | ||
Representation of functional motifs for each superfamily will also be available | Functional Motifs | ||
Evolutionary relationships between the superfamily members are represented by the phylogenetic tree | Phylogenetic Tree |
Table 2. Links in superfamily alignment database: links for all the members within a superfamily.
Superfamily member | Links to programs that generate information useful for COMPARER in deriving structure-based sequence alignment | ATM2SEQ, SEQ | Generate sequence file in PIR format from the PDB file |
SSTRUC, SST | Produce the secondary structure information from PDB | ||
PSA, PSA | Generate solvent accessible information from the PDB file | ||
HBOND, HBD | Produce hydrogen bonding patterns between mainchain–mainchain and mainchain–sidechain of the protein | ||
For each PDB entry within the superfamily, there are various links with other database entries | RCSB | Research Collaboratory for Structural Bioinformatics—PDB | |
EBI | European Bioinformatics Institute | ||
SCOP | Structural Classification of Proteins | ||
HOMSTRAD | Homologous Structure Alignment Database | ||
DDBASE | Dial-Derived Domain Database | ||
CATH | Class Architecture Topology and Homologous superfamily | ||
FSSP | Fold classification based on Structure–Structure alignment of Proteins | ||
DSDBASE | Disulphide Database | ||
PALI | Phylogeny and Alignment of homologous protein structures | ||
PRESAGE | Collaborative resource for structural genomics | ||
MODBASE | A database of annotated comparative protein structure models | ||
3Dee | A database of protein domain definitions | ||
PQS | Protein Quaternary Structure query form at the EBI | ||
STING | Tool for the simultaneous display of information about macromolecule structure and sequence | ||
GRASS | Graphical Representation and Analysis of Structure Server | ||
LPC/CSU | Ligand–Protein Contacts and Contacts of Structural | ||
LPFC | Library of Protein Family Cores | ||
MMDB | Molecular Modelling DataBase, NCBI | ||
GeneCensus | GeneCensus genome comparisons | ||
SYSTERS | Classification of sequences from SWISS-PROT and PIR into disjoint family clusters and hierarchically into superfamily and subfamily clusters | ||
CE | Databases and tools for 3D protein structure comparison and alignment | ||
PROMOTIF | PROMOTIF v.1.0 for analysing protein structural motifs | ||
PROCHECK | Checks the stereochemical quality of a protein structure, producing a number of PostScript plots analysing its overall and residue-by-residue geometry | ||
PDBSUM | Summaries and structural analyses of PDB data files | ||
Pfam | Protein families database of alignments and HMMs | ||
WHATCHECK | WHAT-CHECK: free WHAT IF verification subset |
SELECTION OF SUPERFAMILIES AND CHOICE OF SUPERFAMILY MEMBERS
The superfamilies are named after their codes in SCOP database: for example, 02.03.068 refers to Biotin-carboxylase N-terminal domain-like superfamily. All the protein domains under a superfamily are considered and one representative protein domain entry, of the best resolution and R-factor from each family, is chosen for a preliminary alignment. NMR structures are considered equivalent to a 3.2 Å resolution crystal structure in the present context. Protein structural co-ordinates have been obtained from the Protein Data Bank (2) and protein domain co-ordinates of the desired chain and domain boundaries considered in the superfamily are extracted using the CHAINRESALL (R. Sowdhamini, unpublished results) program. ATM2SEQ (33) is used to obtain the corresponding amino acid sequences and MALIGN (33) for a multiple-sequence alignment using a constant gap penalty of 40. MOTIFS (R. Sowdhamini, unpublished results) provides a percentage identity matrix which is examined, using PERL scripts, to derive a non-redundant representative set of protein domains for the superfamily such that, as far as possible, no two proteins are >27% identical by the MALIGN alignment (see Fig. 1 for a flow chart).
Figure 1.
Steps involved in the choice of representative members of a superfamily alignment (see text for details). Representative members usually are of the highest resolution and R-factor and non-redundant such that no two members have >27% sequence identity.
AUTOMATION OF STRUCTURE-BASED SEQUENCE ALIGNMENTS IN PASS2 AND ASSESSMENT OF ALIGNMENTS
The non-redundant set of superfamily representatives was chosen for a rigorous structure-based sequence alignment using COMPARER (34). The method requires pairs of superposed set of co-ordinates of the proteins to be aligned. Superposition is achieved by the choice of ‘initial equivalences’ which serve as seeds for pairwise rigid-body superposition using PMNFC, a modified form of MNYFIT (35). In order to construct PASS2, we initially tested the quality of alignments obtained by employing non-gap positions of MALIGN-derived alignment as initial equivalences and found the resulting alignments were reliable and reasonably correct for all except one out of 10 randomly selected superfamilies. Alignment accuracies were manually examined for the absence of gaps in the middle of secondary structures and the conservation of core secondary structures. The superposed structures corresponding to the alignment were examined on the graphics to ensure the absence of insertions or deletions in the middle of α-helices and β-strands. As far as possible we have also ensured that the functionally important residues in the members of superfamilies with highly similar functions are topologically equivalent. Initial equivalences were chosen from the initial MALIGN results. Pairs of superposed co-ordinates and the derived equivalences are employed to extend alignments guided by the similarity in the structural environment of individual amino acids. Structural environments of amino acids are described by their backbone conformation (secondary structure and cis-peptide bond), pattern of exposure to solvent and patterns of hydrogen bonding and disulphide bond connectivity. These are stored in accessory structural data files for all the superfamily members. The final COMPARER-derived alignments are annotated for the structural information using JOY4.0 (7,9). The final alignments will be assessed for unusual average r.m.s.d. for individual members and the deletions of conserved secondary structures (R. Sowdhamini, unpublished results). Problematic alignments are being examined by resorting to a different structure comparison program, STAMP4.0 (36) which does not require initial equivalences, before alignment through COMPARER. Where STAMP is not appropriate, the simulated annealing option in COMPARER is being employed and the COMPARER runs re-performed (see Fig. 2 for a flow chart).
Figure 2.
Steps involved in the structure-based sequence alignment of superfamily representatives. COMPARER (19) is primarily employed for the alignment where initial equivalences are obtained from MALIGN (33). Where the resulting alignment has problems due to misalignment, STAMP4.0 (36) is resorted to obtain the initial equivalences, or the simulated annealing option of COMPARER is employed. The final alignment is annotated for the structural features using JOY (7,9).
STRUCTURAL PHYLOGENY AND GENOME SEARCHES
Pairwise percentage identities of the final structure-based sequence alignments are presented in the form of a symmetric matrix. MNYFIT (35) is employed to obtain a rigid-body superposition of the superfamily members, without an update of equivalences, where the initial equivalences are chosen as non-gap positions corresponding to the final alignment. The r.m.s.d. of the structurally equivalent regions, the non-gap positions of the final alignment, are employed to construct a phylogeny of the superfamily members. Such r.m.s.d. values, though not an accurate representation of the structural relationships between representative members of a superfamily, are nevertheless useful in providing a quantitative estimate of the evolutionary divergence of various homologous families under the query superfamily.
Association of genome sequences from around 60 sources with superfamilies in PASS2 was performed using PSI-BLAST search (37), which is sensitive in identifying distant relatives and convenient for automatic searches (38). Such searches were performed with 10 iterations and a liberal E-value of 0.01, using each of the representative superfamily members as a query against the genome databases. The genome sequences, which are either homologues or additional superfamily members, are aligned with the original structure-based alignment and re-annotated using JOY. Where possible, links to such structure-annotated alignments with genome sequence homologues of superfamily members are provided. Figure 3 shows one such structure-based sequence alignment.
Figure 3.
Representative structure-based sequence alignment along with the related genome sequences. This example is for the N-terminal domain of biotin carboxylase-like superfamily and genome sequences have been identified from Zea mays using PSI-BLAST (36) (1mmc as a query sequence and an E-value cut-off of 0.01). The structure-based sequence alignment is using COMPARER (19) and formatted using JOY (9). Solvent-inaccessible and solvent-accessible residues are shown in upper case and lower case, respectively. Residues in positive phi are indicated in italics; residues with disulfide bonds are indicated by the presence of a cedilla (e.g. ç). Hydrogen bonds formed to the main chain amides and main chain carbonyls of the other residues are indicated in boldface or are underlined, respectively. The secondary structures are coloured: α-helices in red, 310-helices in maroon and β-strands in blue.
SCOPE OF PASS2
PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions. In addition, PASS2 acts as a ‘junction’ point to obtain links of representative superfamily members to genome, sequence and structural databases. Structural phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships where sequence similarity breaks down due to poor sequence identity. Structural phylogenies have been reported earlier, for example, for the distantly related hemoglobin family (39) and for the phosphate-binding superfamilies of the triose-phosphate isomerase TIM-barrel fold (40). Both CAMPASS (19) and PASS2 databases are unique in annotating the structural environment of individual residues on the sequence alignment using JOY (7,9). In addition, PASS2 provides these structural files of individual proteins downloadable across the Web and links to other superfamily members in genome databases.
Statistical analyses on SCOP (41) shows that a vast majority of the protein domains under each of the hierarchical structural classifications, with respect to class, fold and superfamily, are single-members. PASS2 has a conscious inclusion of single-member superfamilies in order to ensure the representation of such examples in fold libraries and profiles generated for fold prediction. We are currently employing superfamily alignments in PASS2 to analyse the retention of structural features and the deviation of structural parameters which will be useful in modelling distantly related proteins. We are also performing sequence searches in genome databases with structural templates of superfamily alignments as additional constraints.
Acknowledgments
ACKNOWLEDGEMENTS
We are grateful to Professor Sir Tom Blundell for the first version of the database, and to his group for useful discussions. R.S. is a recipient of a Wellcome Trust Senior Research Fellowship, and V.M. is supported by the Wellcome Trust.
REFERENCES
- 1.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bernstein F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535–542. [DOI] [PubMed] [Google Scholar]
- 3.Gonnet G.H., Cohen,M.A. and Benner,S.A. (1992). Exhaustive matching of the entire protein sequence database. Science, 256, 1443–1445. [DOI] [PubMed] [Google Scholar]
- 4.Sander C. and Schneider,R. (1993) The HSSP database of protein structure-sequence alignments. Nucleic Acids Res., 21, 3105–3109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Rossmann M.G. and Argos,P. (1977). The taxonomy of protein structure. J. Mol. Biol., 109, 99–129. [DOI] [PubMed] [Google Scholar]
- 6.Richardon J.S. (1981) The anatomy and taxonomy of protein structure. Adv. Prot. Chem., 34, 167–339. [DOI] [PubMed] [Google Scholar]
- 7.Overington J.P., Johnson,M.S., Sali,A. and Blundell,T.L. (1990) Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Soc. (London), B241, 132–145. [DOI] [PubMed] [Google Scholar]
- 8.Overington J.P., Zhu,Z.Y., Sali,A., Johnson,M.S., Sowdhamini,R., Louie,G.V. and Blundell,T.L. (1993) Molecular recognition in protein families: a database of aligned three-dimensional structures of related proteins. Biochem. Soc. Trans, 21, 597–604. [DOI] [PubMed] [Google Scholar]
- 9.Mizuguchi K., Deane,C.M., Blundell,T.L., Johnson,M.S. and Overington,J.P. (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics, 14, 617–623. [DOI] [PubMed] [Google Scholar]
- 10.Blundell T.L., Bedarkar,S., Rinderknecht.E. and Humbel,R.E. (1978) Insulin-like growth factor 1. A model for tertiary structure accounting for immunorecactivity and receptor binding. Proc. Natl Acad. Sci. USA, 75, 180–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lesk A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol., 136, 225–270. [DOI] [PubMed] [Google Scholar]
- 12.Chothia C. (1984) Principles that determine the structures of proteins. Annu. Rev. Biochem., 53, 537–572. [DOI] [PubMed] [Google Scholar]
- 13.Murthy M.R.N. (1984) A fast method of comparing protein structure. FEBS Lett., 168, 97–102. [DOI] [PubMed] [Google Scholar]
- 14.Holm L., Ouzounis,C., Sander,C., Tuparev,G. and Vriend,G. (1992) A database of protein-structure families with common folding motifs. Protein Sci., 1, 1691–1698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Russell R.B. and Barton,G.J. (1994) Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts, secondary structure and accessibility. J. Mol. Biol., 244, 332–350. [DOI] [PubMed] [Google Scholar]
- 16.Orengo C.A., Jones,D.T. and Thornton,J.M. (1994) Protein superfamilies and domain superfolds. Nature, 372, 631–634. [DOI] [PubMed] [Google Scholar]
- 17.Murzin A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. [DOI] [PubMed] [Google Scholar]
- 18.Sowdhamini R., Burke,D.F., Huang,J.F., Mizuguchi,K., Nagarajaram,H.A., Srinivasan,N., Steward,R.E. and Blundell,T.L. (1998) CAMPASS: a database of structurally aligned protein superfamilies. Structure, 6, 1087–1094. [DOI] [PubMed] [Google Scholar]
- 19.Sali A. and Blundell,T.L. (1990) Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol., 212, 403–428. [DOI] [PubMed] [Google Scholar]
- 20.Sowdhamini R. and Blundell,T.L. (1995) Automatic identification and analysis of domains in proteins of known crystal structure. Protein Sci., 4, 506–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sayle R.A. and Milner-White,E.J. (1995) RASMOL: biomolecular graphics for all. Trends Biochem. Sci., 20, 374. [DOI] [PubMed] [Google Scholar]
- 22.Orengo C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093–1108. [DOI] [PubMed] [Google Scholar]
- 23.Pearl F.M.G, Lee,D., Bray,J.E, Sillitoe,I., Todd,A.E., Harrison,A.P., Thornton,J.M. and Orengo,C.A. (2000) Assigning genomic sequences to CATH. Nucleic Acids Res., 28, 277–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Holm L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–602. [DOI] [PubMed] [Google Scholar]
- 25.Sonnhammer E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28, 405–420. [DOI] [PubMed] [Google Scholar]
- 26.Balaji S., Sujatha,S., Kumar,S.S. and Srinivasan,N. (2001) PALI—a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res., 29, 61–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dengler U., Siddiqui,A.S. and Barton,G.J. (2001) Protein structural domains: analysis of the 3Dee domains database. Proteins, 42, 332–344. [PubMed] [Google Scholar]
- 28.Siddiqui A.S., Dengler,U. and Barton,G.J. (2001) 3Dee: a database of protein structural domains. Bioinformatics, 17, 200–201. [DOI] [PubMed] [Google Scholar]
- 29.Hutchinson E.G. and Thornton,J.M. (1996) PROMOTIF—a program to identify and analyze structural motifs in proteins. Protein Sci., 5, 212–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Morris A.L., MacArthur,M.W., Hutchinson,E.G. and Thornton,J.M. (1992) Stereochemical quality of protein structure coordinates. Proteins, 12, 345–364. [DOI] [PubMed] [Google Scholar]
- 31.Laskowski R.A., MacArthur,M.W., Moss,D.S. and Thornton,J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr., 26, 283–291. [Google Scholar]
- 32.Krause A., Stoye,J. and Vingron,M. (2000) The SYSTERS protein sequence cluster set. Nucleic Acids Res., 28, 270–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Johnson M.S., Overington,J.P. and Blundell,T.L. (1993) Alignment and searching for common protein folds using a data bank of structural templates. J. Mol. Biol., 231, 735–752. [DOI] [PubMed] [Google Scholar]
- 34.Sali A. and Blundell,T.L. (1990) Definition of general topological equivalence in protein structures—a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol., 212, 403–428. [DOI] [PubMed] [Google Scholar]
- 35.Sutcliffe M.J., Haneef,I., Carney,D. and Blundell,T.L. (1987) Knowledge based modelling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng., 1, 377–384. [DOI] [PubMed] [Google Scholar]
- 36.Russell R.B. and Barton,G.J. (1992) Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins, 14, 309–323. [DOI] [PubMed] [Google Scholar]
- 37.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Park J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273, 349–354. [DOI] [PubMed] [Google Scholar]
- 39.Johnson M.S., Sutcliffe,M.J. and Blundell,T.L. (1990) Molecular anatomy: phyletic relationships derived from three-dimensional protein structures. J. Mol. Evol., 30, 43–59. [DOI] [PubMed] [Google Scholar]
- 40.Bork P., Gellerich,J., Groth,H., Hooft,R. and Martin,F. (1995) Divergent evolution of a β/α-barrel subclass: detection of numerous phosphate-binding sites by motif search. Protein Sci., 4, 268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Brenner S.E., Chothia,C. and Hubbard,T.J. (1997) Population statistics of protein structures: lessons from structural classifications. Curr. Opin. Struct. Biol., 7, 369–376. [DOI] [PubMed] [Google Scholar]