Abstract
SBASE 7.0 is the seventh release of the SBASE protein domain library sequences that contains 237 937 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections. The entries are clustered into over 1811 groups and are provided with two WWW-based search facilities for on-line use. SBASE 7.0 is freely available by anonymous ‘ftp’ file transfer from ftp.icgeb.trieste.it . Automated searching of SBASE with BLAST can be carried out with the WWW servers http://www.icgeb. trieste.it/sbase/ and http://sbase.abc.hu/sbase/
INTRODUCTION
Prediction of domains is usually based on pattern collections that contain consensus representations of domain types deduced from multiple alignments. Consensus representation of sequences (such as consensus sequences, regular expressions, sequence profiles, hidden Markov models, etc.) requires human expertise and careful judgement hence pattern collections can hardly keep pace with the flow of new genome data. Another problem is the inevitable statistical bias of consensus representations. Namely, reliable multiple alignments require a good number of domain examples, and, as a consequence, atypical domains for which there are too few known examples, may be difficult to recognize. Finally, there are domain types for which it is not easy to develop consensus representations because of the weak similarity.
SBASE is a collection of protein domain sequences designed to facilitate detection of domain homologies without the above problems (1,2). The method of domain recognition is database search rather than pattern search, so atypical and typical domains are equally well recognized (3). The central concept is the ‘similarity group’, i.e. a group of domain sequences that have BLAST similarity to each other. One can distinguish tight and loose groups depending on how many significant similarity connections exist, on average, between the members. Briefly, a new sequence is considered member of a given group if its similarity parameters (3) are above the threshold levels automatically established for that group, and if it has no sequential overlap with any other domain group. Validated domain groups i.e. the 1550 groups that satisfy these criteria are deposited in SBASE-A; these are the well-known structural and functional domain types. SBASE-B contains a 261 groups that are either (i) less well characterized than the groups of SBASE-A, or (ii) are defined by composition (e.g. glycine-rich), cellular location (e.g. transmembrane, etc.). These groups are sometimes defined in an overlapping manner, e.g. an extracellular domain (SBASE-B) may contain an EGF-module (SBASE-A).
The current release 7.0 of SBASE contains over 230 000 annotated protein sequence segments consistently named by structure, function, biased composition, binding-specificity and/or similarity to other proteins.
The main developments with respect to the previous release [release 6] can be summarized as follows:
(i) Release 7.0 contains 237 937 sequence entries, 82% more than release 6.0 (Table 1).
Table 1. Increase of data in SBASE release 7.0.
(ii) The entries were grouped by standard names and further classified on the basis of the BLAST similarity scores. The list of all clusters having at least two members is deposited into a separate database, SBASE-CLUSTERS, which is now available through anonymous ftp as well as through links on the WWW-server. A total of 1811 domain groups were found, of which 1550 validated groups (1936 clusters) are in SBASE-A and 261 groups (382 clusters) are in SBASE-B. The clusters are identified by the standard name and by the (optional) subclass number included in the SC field. The CL and CE fields of previous releases are now abandoned.
DESCRIPTION OF THE DATA
Definition of protein domains
Domains included in SBASE are protein sequence segments with known structure and/or function. The main entry classes are summarized in Table 2. The boundaries of the domains are either as previously defined in the original publications or determined by homology to domains with known boundaries such as given in the PROT-FAM (4) and in the PFAM databases (5).
Table 2. Examples of domains in SBASE 7.0.
Source and origin of data
SBASE data originate from three main sources: (i) from the SWISS-PROT protein sequence databank (6); (ii) from the Protein Sequence Database of the Protein Identification Resource (PIR International) (7); and (iii) from the literature. The sequences are either translated from nucleotide sequence databases (8,9) or directly keyed in at the protein level. From a total of 237 937 records in SBASE 7.0, 136 367 (57%), 53 307 (22.4%) and 38 083 (20.6%) are of eukaryotic, prokaryotic and viral origin, respectively. Domain sizes vary in length between 5 and 1000 amino acids.
Cross-references
SBASE 7.0 has cross-references to several protein and nucleic acid databanks, as well as to the PROSITE (10) PRINTS (11), ProDom (12), BLOCKS (13) and PFAM (5) domain databases, the Protein Structure Data Bank (14) and the database of human Mendelian inheritance (15) (Table 3). In each record, the DR-lines contain the cross-reference data.
Table 3. Cross-references to other databases in SBASE.
Record structure
The format of SBASE 7.0 follows that of the EMBL and SWISS-PROT databases and can be directly formatted under the GCG program package using (16).
DISTRIBUTION AND ACCESS
Distribution
SBASE 7.0 (6 October, 1999) is distributed by anonymous ‘ftp’ file transfer from ftp.icgeb.trieste.it . The complete database (including the records and list of clusters), is 221 Mb, its compressed form is 16.3 Mb.
BLAST search by WWW-server
SBASE 7.0 can be searched by the BLAST program using the WWW-server http://www.icgeb.trieste.it/sbase . A related server was created in order to assign SBASE domain homologies on the basis of BLAST searches performed on the SWISS-PROT database and on the PIR International databases (7). This service (available at http://www.abc.hu/blast.html and at domain@abc.hu ) returns the best potential domain homologies ranked according to BLAST score.
Access by WWW-server
Record retrieval and the above services can be accessed also using the WWW-server at http://www.icgeb.trieste.it/sbase . At present, cross-references to SBASE-CLUSTERS, EMBL, MEDLINE, MIM, PRINTS, ProDom, PROSITE and SWISS-PROT can be directly accessed through the WWW-server.
Citation
Users of SBASE and of the WWW servers are asked to cite this article in their publications, e.g. in the following form: ‘The sequence homologies were analyzed searching the SBASE protein domain sequence library release 7.0’ via automated electronic mail (WWW) server’.
Acknowledgments
ACKNOWLEDGEMENTS
This work was supported in part by EMBnet, the European Molecular Biology Network in the framework of EU grant ERBBIO4-CT96-0030. SBASE was established in 1990 and is maintained collaboratively by the International Center for Genetic Engineering and Biotechnology, Trieste, Italy and the Agricultural Biotechnology Center, Gödöllö, Hungary. The help of Suzanne Kerbavcic with the manuscript is gratefully acknowledged.
REFERENCES
- 1.Pongor S., Skerl,V., Cserzo,M., Hatsagi,Z., Simon,G. and Bevilacqua,V. (1993) Protein Eng., 6, 391–395. [DOI] [PubMed] [Google Scholar]
- 2.Murvai J., Vlahovicek,K., Barta,E., Szepesvári,C., Acatrinei,C. and Pongor,S. (1999) Nucleic Acids Res., 27, 257–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Murvai J., Vlahovicek,K., Barta,E., Parthasarathy,S., Hegyi,H., Pfeiffer,F. and Pongor,S. (1999) Bioinformatics, 15, 343–344. [DOI] [PubMed] [Google Scholar]
- 4.Mewes H.W., Heumann,K., Kaps,A., Mayer,K., Pfeiffer,F., Stocker,S. and Frishman,D. (1999) Nucleic Acids Res., 27, 44–48. Updated article in this issue: Nucleic Acids Res. (2000), 28, 37–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bateman A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L. (1999) Nucleic Acids Res., 27, 260–262. Updated article in this issue: Nucleic Acids Res. (2000), 28, 263–266.9847196 [Google Scholar]
- 6.Bairoch A. and Apweiler,R. (1999) Nucleic Acids Res., 27, 49–54. Updated article in this issue: Nucleic Acids Res. (2000), 28, 45–48.9847139 [Google Scholar]
- 7.Barker W.C., Garavelli,J.S., McGarvey,P.B., Marzec,C.R., Orcutt,B.C., Srinivasarao,G.Y., Yeh,L.S., Ledley,R.S., Mewes,H.W., Pfeiffer,F., Tsugita,A. and Wu,C. (1999) Nucleic Acids Res., 27, 39–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Benson D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F., Rapp,B.A. and Wheeler,D.L. (1999) Nucleic Acids Res., 27, 12–17. Updated article in this issue: Nucleic Acids Res. (2000), 28, 15–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Stoesser G., Tuli,M.A., Lopez,R. and Sterk,P. (1999) Nucleic Acids Res., 27, 18–24. Updated article in this issue: Nucleic Acids Res. (2000), 28, 19–23.9847133 [Google Scholar]
- 10.Hofmann K., Bucher,P., Falquet,L. and Bairoch,A. (1999) Nucleic Acids Res., 27, 215–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Attwood T.K., Flower,D.R., Lewis,A.P., Mabey,J.E., Morgan,S.R., Scordis,P., Selley,J.N. and Wright,W. (1999) Nucleic Acids Res., 27, 220–225. Updated article in this issue: Nucleic Acids Res. (2000), 28, 225–227.9847185 [Google Scholar]
- 12.Corpet F., Gouzy,J. and Kahn,D. (1999) Nucleic Acids Res., 27, 263–267. Updated article in this issue: Nucleic Acids Res. (2000), 28, 267–269.9847197 [Google Scholar]
- 13.Henikoff J.G., Henikoff,S. and Pietrokovski,S. (1999) Nucleic Acids Res., 27, 226–228. Updated article in this issue: Nucleic Acids Res. (2000), 28, 228–230.9847186 [Google Scholar]
- 14.Bernstein F.C., Koetzle,T.F., Williams,G.J., Meyer,E.E.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542. [DOI] [PubMed] [Google Scholar]
- 15.Pearson P., Francomano,C., Foster,P., Bocchini,C., Li,P. and McKusick,V. (1994) Nucleic Acids Res., 22, 3470–3473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Flybase-Consortium (1999) Nucleic Acids Res., 27, 85–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rudd K.E., Bouffard,G. and Miller,G. (1992) In Davies,K.E. and Tilghman,S.M. (eds), Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 1–38.
- 18.Roberts R.J. and Macelis,D. (1999) Nucleic Acids Res., 27, 312–313. Updated article in this issue: Nucleic Acids Res. (2000), 28, 306–307.9847213 [Google Scholar]