Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2003 Jan 1;31(1):463–468. doi: 10.1093/nar/gkg117

Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species

Akihiro Yamaguchi, Mitsuo Iwadate 1, Ei-ichiro Suzuki 1,a, Kei Yura a, Shigetsugu Kawakita 1, Hideaki Umeyama 1,*,a, Mitiko Go
PMCID: PMC165564  PMID: 12520053

Abstract

Enlarged FAMSBASE is a relational database of comparative protein structure models for the whole genome of 41 species, presented in the GTOP database. The models are calculated by Full Automatic Modeling System (FAMS). Enlarged FAMSBASE provides a wide range of query keys, such as name of ORF (open reading frame), ORF keywords, Protein Data Bank (PDB) ID, PDB heterogen atoms and sequence similarity. Heterogen atoms in PDB include cofactors, ligands and other factors that interact with proteins, and are a good starting point for analyzing interactions between proteins and other molecules. The data may also work as a template for drug design. The present number of ORFs with protein 3D models in FAMSBASE is 183 805, and the database includes an average of three models for each ORF. FAMSBASE is available at http://famsbase.bio.nagoya-u.ac.jp/famsbase/.

INTRODUCTION

Genome sequencing projects have generated an enormous amount of protein sequence information (1). About half of the encoded amino acid sequences are for proteins of unknown function (2), and computational and experimental methods have been developed to obtain any functional information on these proteins (3). Proteins only function when they correctly fold, and the three dimensional (3D) structure of proteins is one of the most important pieces of information for predicting function (4). Functional sites are dispersed in a protein's amino acid sequence, but upon folding are placed in close spatial relationship. In an enzyme, for instance, a ligand binds to a pocket on the surface of the protein, and the structure of the pocket basically determines which ligands can interact with the enzyme. In order to assess the function of these unstudied proteins, structural genomic projects have been started. However, one cannot determine every protein 3D structure within a reasonable time, and therefore, homology modeling will play an important role in the coming era of structural genomics (5). Thus, assessing the ratio of ORFs whose protein 3D structures can be modeled by the present homology modeling methods is important for the methods and for deciding target sequences for structural genomics. An appropriate target selection for the structural genomics will effectively increase template structures for the homology modeling.

We developed enlarged FAMSBASE, a database of protein homology modeling against the whole genomes of 41 species by expanding former FAMSBASE against genomes of two species (6,7). The details of FAMSBASE will be published elsewhere (Umeyama et al., in preparation.) In this report, we describe the features and statistics of enlarged FAMSBASE.

FEATURES OF FAMSBASE

FAMSBASE is a PostgreSQL driven relational database. Homology modeling requires template searching, sequence alignment between template and target sequences and modeling. In FAMSBASE, template searching and sequence alignment are wholly based on the GTOP database (8). In the 2001 version of GTOP database, the whole genome sequences of 41 species were processed through PSI-BLAST analysis (9) against the amino acid sequences of proteins in the Protein Data Bank (PDB) (10). ORFs in genome sequences with E-values from PSI-BLAST results of less than 0.001 were treated as ORFs having template structures. Every ORF with corresponding 3D structure in PDB is automatically modeled by FAMS (Full Automatic Modeling System) (11), and the atomic coordinates of such models are stored in FAMSBASE. FAMS participated in CAFASP2, the second Critical Assessment of Fully Automated Structure Prediction, and outperformed other methods (12,13). Based on a template protein and a pair-wise alignment found by PSI-BLAST with a threshold E-value of 0.001, FAMS first builds a protein backbone by minimizing the conformational energy with a simulated annealing method, and then generates side chains for each residue. The main chain is then optimized with a constraint on all side chains. The above procedure is iteratively applied. The details of the procedure will be explained elsewhere (Umeyama et al., in preparation). FAMS is now accessible at http://physchem.pharm.kitasato-u.ac.jp/. Model building of those ORFs has been carried out on 1000 nodes of PC clusters. The operating system will be published elsewhere (Umeyama et al., in preparation).

Enlarged FAMSBASE is located at http://famsbase.bio.nagoya-u.ac.jp/famsbase/ and freely accessible from academic sites. For accesses from a company, restrictions have been imposed. In enlarged FAMSBASE, one can find a protein 3D structure of a certain ORF by gene name, PDB ID of the template, or keywords, or alternatively, one can also search the modeled structure using FASTA sequence search tool (14) (Fig. 1). In enlarged FAMSBASE a search can also be performed using names of PDB heterogen atoms. Protein 3D structures are often determined with non-protein molecules, such as ATP, DNA and heme. When template structures for modeling include heterogen molecules, the modeled proteins may also bind similar molecules. In enlarged FAMSBASE, given a name of a heterogen molecule, one can find ORFs whose 3D structure templates have heterogen molecules, such as an ATP molecule on a transporter (Fig. 2). This information may suggest functionally important sites of the protein encoded by the ORFs. Other analyses, such as checking for conserved amino acid residues at the putative heterogen-binding sites and calculating binding energy should also be performed for rigorous binding site prediction.

Figure 1.

Figure 1

The FAMSBASE website. Species names whose genome sequences are available are listed at the top page. Search tools are listed at the bottom of the page.

Figure 2.

Figure 2

Predicted interactions of modeled structure and ATP. The enlarged FAMSBASE can be searched by names of heterogen molecules attached to template structures. When enlarged FAMSBASE is searched by ‘ATP’, ORFs whose template 3D structures were solved with ATP are listed. The 3D structure can be shown with the heterogen atoms. Note that the location of heterogen atoms was not optimized using the modeled 3D structures. A model structure is shown in yellow and ATP is shown in colors that clarify differences of atoms.

STATISTICS IN FAMSBASE

Enlarged FAMSBASE contains protein 3D structure models for whole genomes of 41 species (Table 1). The number of ORFs with 3D structure is now 51 430. This number consists of about 42% of whole ORFs of 41 species (Table 1). A percentage of 3D structures against the number of ORFs in the bacteriophage T4 genome is relatively small compared to that of other genomes. This is due to the sequence diversity of proteins encoded by the bacteriophage genome, and may reflect distinct evolution of this organism. In enlarged FAMSBASE, each ORF has at most five 3D structure models. The five models were created based on the top five hits using PSI-BLAST against PDB, as shown in GTOP. When the number of hits was less than five, all the hits were used as the template. The average number of models for each ORF was three. A user can compare the five models for a single ORF and assume a reliable 3D structure. When the modeled structures are completely different from one another, even though the models are supposed to be of the same domain, then the modeled structure is unreliable. The number of models in the current FAMSBASE is 183 805.

Table 1. Species and proportion of protein 3D structures in enlarged FAMSBASE.

Organism # model # ORF # modeled ORF (%)
Aeropyrum pernix (aero) 2177 2694 620 (23.0)
Archaeoglobus fulgidus (aful) 3530 2407 996 (41.4)
Halobacterium sp. (hbsp) 2845 2058 843 (41.0)
Methanococcus jannaschii (mjan) 2471 1715 698 (40.7)
Methanobacterium thermoautotrophicum (mthe) 2789 1869 798 (42.7)
Pyrococcus abyssi (pabys) 2818 1765 807 (45.7)
Pyrococcus horikoshii (pyro) 2524 2061 734 (35.6)
Thermoplasma acidophilum (tacid) 2469 1478 713 (48.2)
Aquifex aeolicus (aqua) 2574 2694 749 (27.8)
Borrelia burgdorferi (bbur) 1467 1255 451 (35.9)
Bacillus halodurans (bhal) 6314 4066 1768 (43.5)
Bacillus subtilis (bsub) 6346 4100 1794 (43.8)
Buchnera sp. APS (buch) 1208 574 372 (64.8)
Campylobacter jejuni (cjej) 2715 1634 793 (48.5)
Chlamydophila pneumoniae (cpneu) 1442 1052 434 (41.3)
Chlamydia trachomatis (ctra) 1378 894 405 (45.3)
Chlamydia muridarum (ctraM) 1406 909 422 (46.4)
Deinococcus radiodurans (drad) 4454 3102 1282 (41.3)
Escherichia coli (ecoli) 6922 4289 1997 (46.6)
Escherichia coli O157:H7 (ecoli_O157) 7557 5349 2228 (41.7)
Haemophilus influenzae (hinf) 2886 1709 860 (50.3)
Helicobacter pylori (hpyl) 2211 1566 643 (41.1)
Lactococcus lactis (llact) 3733 2266 1088 (48.0)
Mycoplasma genitalium (mgen) 834 480 252 (52.5)
Mycoplasma pneumoniae (mpneu) 944 688 283 (41.1)
Mycobacterium tuberculosis (mtub) 6033 4066 1722 (42.4)
Neisseria meningitidis (nmen) 2951 2025 853 (42.1)
Pseudomonas aeruginosa (paer) 9238 5565 2659 (47.8)
Pasteurella multocida (pmul) 3557 2014 1045 (51.9)
Rickettsia prowazekii (rpxx) 1389 834 413 (49.5)
Synechocystis sp. PCC6803 (syne) 4876 3167 1364 (43.1)
Thermotoga maritima (tmar) 2998 1864 866 (46.5)
Treponema pallidum (tpal) 1410 1031 402 (39.0)
Ureaplasma urealyticum (uure) 947 611 290 (47.5)
Vibrio cholerae (vcho) 5655 3828 1634 (42.7)
Xylella fastidiosa (xfas) 3328 2831 971 (34.0)
Caenorhabditis elegans (cele) 27 297 19 730 7408 (37.6)
Drosophila melanogaster (dmel) 25 011 14 335 6345 (44.3)
Homo sapiens (huge) 3760 1771 930 (52.5)
Saccharomyces cerevisiae (yst) 9230 6305 2453 (38.9)
Bacteriophage T4 (t4) 111 275 4 (16.4)
Total 183 805 122 926 51 430 (41.8)

When each 3D structure of ORFs is checked in detail, one will find that only a few ORFs are fully modeled. Most 3D structure models are of parts of the ORFs, which are supposed to represent domains (Fig. 3). This situation is, however, different among superkingdoms. In archaea and eubacteria genomes, more than 50% of all ORFs have 3D structures for a more than 80% portion of their ORFs. On average, 71% of each ORF in archaea and 68% of each ORF in eubacteria are modeled. In eukaryotic genomes, however, less than 40% of ORFs have 3D structures for a more than 80% portion of their ORFs. On average, a 39% portion of each ORF is modeled. This is a consequence of the multi-domain structure of proteins in eukaryotes (15). Furthermore, it indicates that our knowledge of eukaryotic proteins is not sufficient to understand the whole structure of single proteins in eukaryotes. Knowledge of domain–domain interactions within single ORFs in eukaryotic proteins will be required soon. Even with X-ray crystallography, structural determination of an entire eukaryotic protein is a difficult task because of its large mass.

Figure 3.

Figure 3

Percentage of modeled portions of each ORF. Difference in coverage of ORFs by 3D structure is shown in different colors, as explained in the right side of the figure. White means less than 10% of an amino acid sequence, light gray means more than 10% but less than 20%, dark gray means more than 20% but less than 30% of an amino acid sequence, and likewise. In archaeal and eubacterial genomes, more than half of the ORFs are modeled at a more than 90% portion of the sequences. In eukaryotic genomes, however, less than 30% of the ORFs are modeled at a more than 90% portion of the sequences. This is because eukaryotic proteins have long amino acid sequences and multi-domain organization (15).

The superfamilies of modeled structures differ among the three superkingdoms (Table 2). The structures are classified based on the SCOP category (16). The most common model in all three superkingdoms is a P-loop protein. After the P-loop, the most common folds differ among each superkingdom. In eukaryotes, protein kinase, homeodomain and EGF/Laminin nuclear receptor models are included in the top ten entries, and all of these domains are known to diverge in eukaryotic genomes (17). This distribution is similar to that reported based on the whole genome protein fold assignment by Koonin et al. (18). Enlarged FAMSBASE provides coordinates of each protein within the superfamily and provides a chance to analyze the differences among proteins of the same superfamily.

Table 2. Top 10 most common 3D structures among the three superkingdoms of life.

No. of models Superfamily name
Archaea  
 2960 P-loop containing nucleotide triphosphate hydrolases
 913 4Fe–4S ferredoxins
 828 S-adenosyl-L-methionine-dependent methyltransferases
 699 PLP-dependent transferases
 566 NAD(P)-binding Rossmann-fold domains
 547 Metallo-hydrolase/oxidoreductase
 433 FAD/NAD-linked reductases, dimerisation (C-terminal) domain
 404 Class II aaRS and biotin synthetases
 343 Nucleotidylyl transferase
 339 Nucleotide-diphospho-sugar transferases
Eubacteria  
 11 387 P-loop containing nucleotide triphosphate hydrolases
 3718 NAD(P)-binding Rossmann-fold domains
 2837 CheY-like
 2805 S-adenosyl-L-methionine-dependent methyltransferases
 2699 PLP-dependent transferases
 2066 Periplasmic binding protein-like II
 1790 Thioredoxin-like
 1639 alpha/beta-Hydrolases
 1553 FAD/NAD-linked reductases, dimerisation (C-terminal) domain
 1414 Class II aaRS and biotin synthetases
Eukaryote  
 5203 P-loop containing nucleotide triphosphate hydrolases
 4601 Protein kinase-like (PK-like)
 2910 C2H2 and C2HC zinc fingers
 1842 EGF/Laminin
 1487 alpha/beta-Hydrolases
 1418 RNA-binding domain, RBD
 1365 NAD(P)-binding Rossmann-fold domains
 1312 Thioredoxin-like
 1303 Nuclear receptor ligand-binding domain
 1149 Homeodomain-like

ACCURACY OF THE MODELS

The accuracy of modeled structures is known to depend on the level of sequence identity between target and modeled proteins (19). The distribution of sequence identities in enlarged FAMSBASE is given in Figure 4. About a quarter of all models have more than 25% sequence identity. The reliability of the models is expressed by Hubbard plots (20) (Fig. 5). Since building the current enlarged FAMSBASE, the 3D structures of some target proteins have been determined. Comparison of the models in enlarged FAMSBASE with the real 3D structures is, therefore, a good blind test. When sequence identity is more than 25%, the model is reasonably good, with the exception of a few cases. Of the 212 tested models, 181 (85%) have RMSD (root mean square deviation) less than 3.0 Å through at least 90% of the entire structure.

Figure 4.

Figure 4

Identity distribution between target and template sequences in enlarged FAMSBASE. Sequence identities are shown by color, as explained on the right side of the figure. White means template and target sequences have less than 10% sequence identity, light gray means between 10 and 20%, and likewise. Models with less than 20% sequence identity occupy about half of the database. Structural genomics projects are expected to provide better templates for genome-wide comparative modeling.

Figure 5.

Figure 5

Hubbard plots of 212 modeled 3D structures and real structures with sequence identity of more than 25%. The 3D structures of 212 proteins were determined after building enlarged FAMSBASE. The horizontal axis is the number of superimposed residues and the vertical axis is the best root mean square deviation given by the number of superimposed residues. A precise 3D model has a small RMSD for superimposition of many residues. An unreliable 3D model has a large RMSD for superimposition of a few residues. See reference 20 for detail.

About 75% of the models in enlarged FAMSBASE have less than 25% sequence identity. Even with models based on low sequence identity, appropriate analysis can be performed (19,21). In one case, a homology model based on an alignment of less than 18% sequence identity yielded a significant biological result (22). Hubbard plots between the modeled protein 3D structures in enlarged FAMSBASE and the real target 3D structures, reported after building FAMSBASE and showing less than 25% sequence identity, are shown in Figure 6. Of 237 examined models, 73 (31%) have RMSD less than 3.0 Å through at least 90% of the entire structure. The blind test suggests that at least 31% of the modeled structures with sequence identity less than 25%, that is 37 428 out of 120 737 modeled structures, were reasonably accurate.

Figure 6.

Figure 6

Hubbard plots of modeled 3D structures and real structures with sequence identity less than 25%. The 3D structures of 237 proteins were determined after building enlarged FAMSBASE.

Even with FAMS, protein 3D structures derived from only about 42% of ORFs were modeled. To generate protein 3D models of the entire ORF encoded in a genome, two efforts are underway. One is to let structural genomics projects solve protein structures that can be used as templates for a wide range of proteins. The other is to further improve the method of homology modeling to enable researchers to build highly reliable model structures based on a template of less than 20% sequence identity. With both efforts, the information from genome sequences will begin to be used for biologically important issues, such as functional site analyses, ligand docking and protein–protein interactions.

FUTURE DIRECTIONS

FAMSBASE will be expanded by increasing the number of genomes with protein 3D structures.

Acknowledgments

ACKNOWLEDGEMENTS

This work was supported by a Grant-in-Aid for Scientific Research on Priority Area (C), Genome Information Science from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The authors are grateful to Dr. Ken Nishikawa of NIG for the gift of genome sequence information of 41 species processed through PSI-BLAST analysis against amino acid sequences of proteins in PDB.

REFERENCES

  • 1.Nierman W.C., Eisen,J.A., Fleischmann,R.D. and Fraser,C.M. (2000) Genome data: what do we learn? Curr. Opin. Struct. Biol., 10, 343–348. [DOI] [PubMed] [Google Scholar]
  • 2.Kim S.-H. (2000) Structural genomics of microbes: an objective. Curr. Opin. Struct. Biol., 10, 380–383. [DOI] [PubMed] [Google Scholar]
  • 3.Searls D.B. (2000) Bioinformatics tools for whole genomes. Annu. Rev. Genomics Hum. Genet., 1, 251–279. [DOI] [PubMed] [Google Scholar]
  • 4.Domingues F.S., Koppensteiner,W.A. and Sippl,M.J. (2000) The role of protein structure in genomics. FEBS Lett., 476, 98–102. [DOI] [PubMed] [Google Scholar]
  • 5.Brenner S.E. (2001) A tour of structural genomics. Nature Rev. Genet., 2, 801–809. [DOI] [PubMed] [Google Scholar]
  • 6.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota, M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) 3D–1D modeling of all proteins encoded in the genome. Research and Development for Accelerating the Construction of the Infrastructure for Biological Resource Information (Bio-informatics), April 1999 to March 2000. Japan Bio-industry Association Foundation, Tokyo, Japan, pp. 633–666.
  • 7.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota,M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) FAMSBASE: construction of model data base by homology analysis and full automatic comparative modeling for all ORFs of Escherichia coli and Bacillus subtilis. 28th Symposium on Structure–Activity Relationships, Pharmaceutical Society of Japan. October 26–27, Kyoto, Japan, pp. 222–225, 343.
  • 8.Kawabata T., Fukuchi,S., Homma,K., Ota,M., Araki,J., Ito,T., Ichiyoshi,N. and Nishikawa,K. (2002) GTOP: a database of protein structures predicted from genome sequences. Nucleic Acids Res., 30, 294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Berman H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ogata K. and Umeyama,H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graph. Model., 18, 258–272. [DOI] [PubMed] [Google Scholar]
  • 12.Iwadate M., Ebisawa,K. and Umeyama,H. (2001) Comparative Modeling of CAFASP2 competition. Chem-Bio Informatics J., 1, 136–148. [Google Scholar]
  • 13.Fischer D., Elofsson,A., Rychlewski,L., Pazos,F., Valencia,A., Rost,B., Ortiz,A.R. and Dunbrack,R.L.Jr (2001) CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins, 5 (Suppl.), 171–183. [DOI] [PubMed] [Google Scholar]
  • 14.Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [DOI] [PubMed] [Google Scholar]
  • 16.Lo Conte L., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ponting C.P., Schultz,J.,Copley,R.R., Andrade,M.A. and Bork,P. (2000) Evolution of domain families. Adv. Protein Chem., 54, 185–244. [DOI] [PubMed] [Google Scholar]
  • 18.Koonin E.V., Wolf,Y.I. and Aravind,L. (2000) Protein Fold recognition using sequence profiles and its application in structural genomics. Adv. Protein Chem., 54, 245–275. [DOI] [PubMed] [Google Scholar]
  • 19.Baker D. and Sali,A. (2001) Protein structure prediction and structural genomics. Science, 294, 93–96. [DOI] [PubMed] [Google Scholar]
  • 20.Hubbard T.J. (1999) RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions. Proteins: Struct. Func. Genet., 3 (Suppl.), 15–21. [DOI] [PubMed] [Google Scholar]
  • 21.Irving J.A., Whisstock,J.C. and Lesk,A.M. (2001) Protein structural alignments and functional genomics. Proteins, 42, 378–382. [DOI] [PubMed] [Google Scholar]
  • 22.Smith B.J., Lawrence,M.C. and Colman,P.M. (2002) Modelling the structure of the fusion protein from human respiratory syncytial virus. Protein Eng., 15, 365–371. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES