Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species

Akihiro Yamaguchi; Mitsuo Iwadate; Ei-ichiro Suzuki; Kei Yura; Shigetsugu Kawakita; Hideaki Umeyama; Mitiko Go

doi:10.1093/nar/gkg117

. 2003 Jan 1;31(1):463–468. doi: 10.1093/nar/gkg117

Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species

Akihiro Yamaguchi, Mitsuo Iwadate ¹, Ei-ichiro Suzuki ^1,^a, Kei Yura ^a, Shigetsugu Kawakita ¹, Hideaki Umeyama ^1,^*,^a, Mitiko Go

PMCID: PMC165564 PMID: 12520053

Abstract

Enlarged FAMSBASE is a relational database of comparative protein structure models for the whole genome of 41 species, presented in the GTOP database. The models are calculated by Full Automatic Modeling System (FAMS). Enlarged FAMSBASE provides a wide range of query keys, such as name of ORF (open reading frame), ORF keywords, Protein Data Bank (PDB) ID, PDB heterogen atoms and sequence similarity. Heterogen atoms in PDB include cofactors, ligands and other factors that interact with proteins, and are a good starting point for analyzing interactions between proteins and other molecules. The data may also work as a template for drug design. The present number of ORFs with protein 3D models in FAMSBASE is 183 805, and the database includes an average of three models for each ORF. FAMSBASE is available at http://famsbase.bio.nagoya-u.ac.jp/famsbase/.

INTRODUCTION

Genome sequencing projects have generated an enormous amount of protein sequence information (1). About half of the encoded amino acid sequences are for proteins of unknown function (2), and computational and experimental methods have been developed to obtain any functional information on these proteins (3). Proteins only function when they correctly fold, and the three dimensional (3D) structure of proteins is one of the most important pieces of information for predicting function (4). Functional sites are dispersed in a protein's amino acid sequence, but upon folding are placed in close spatial relationship. In an enzyme, for instance, a ligand binds to a pocket on the surface of the protein, and the structure of the pocket basically determines which ligands can interact with the enzyme. In order to assess the function of these unstudied proteins, structural genomic projects have been started. However, one cannot determine every protein 3D structure within a reasonable time, and therefore, homology modeling will play an important role in the coming era of structural genomics (5). Thus, assessing the ratio of ORFs whose protein 3D structures can be modeled by the present homology modeling methods is important for the methods and for deciding target sequences for structural genomics. An appropriate target selection for the structural genomics will effectively increase template structures for the homology modeling.

We developed enlarged FAMSBASE, a database of protein homology modeling against the whole genomes of 41 species by expanding former FAMSBASE against genomes of two species (6,7). The details of FAMSBASE will be published elsewhere (Umeyama et al., in preparation.) In this report, we describe the features and statistics of enlarged FAMSBASE.

FEATURES OF FAMSBASE

FAMSBASE is a PostgreSQL driven relational database. Homology modeling requires template searching, sequence alignment between template and target sequences and modeling. In FAMSBASE, template searching and sequence alignment are wholly based on the GTOP database (8). In the 2001 version of GTOP database, the whole genome sequences of 41 species were processed through PSI-BLAST analysis (9) against the amino acid sequences of proteins in the Protein Data Bank (PDB) (10). ORFs in genome sequences with E-values from PSI-BLAST results of less than 0.001 were treated as ORFs having template structures. Every ORF with corresponding 3D structure in PDB is automatically modeled by FAMS (Full Automatic Modeling System) (11), and the atomic coordinates of such models are stored in FAMSBASE. FAMS participated in CAFASP2, the second Critical Assessment of Fully Automated Structure Prediction, and outperformed other methods (12,13). Based on a template protein and a pair-wise alignment found by PSI-BLAST with a threshold E-value of 0.001, FAMS first builds a protein backbone by minimizing the conformational energy with a simulated annealing method, and then generates side chains for each residue. The main chain is then optimized with a constraint on all side chains. The above procedure is iteratively applied. The details of the procedure will be explained elsewhere (Umeyama et al., in preparation). FAMS is now accessible at http://physchem.pharm.kitasato-u.ac.jp/. Model building of those ORFs has been carried out on 1000 nodes of PC clusters. The operating system will be published elsewhere (Umeyama et al., in preparation).

Enlarged FAMSBASE is located at http://famsbase.bio.nagoya-u.ac.jp/famsbase/ and freely accessible from academic sites. For accesses from a company, restrictions have been imposed. In enlarged FAMSBASE, one can find a protein 3D structure of a certain ORF by gene name, PDB ID of the template, or keywords, or alternatively, one can also search the modeled structure using FASTA sequence search tool (14) (Fig. 1). In enlarged FAMSBASE a search can also be performed using names of PDB heterogen atoms. Protein 3D structures are often determined with non-protein molecules, such as ATP, DNA and heme. When template structures for modeling include heterogen molecules, the modeled proteins may also bind similar molecules. In enlarged FAMSBASE, given a name of a heterogen molecule, one can find ORFs whose 3D structure templates have heterogen molecules, such as an ATP molecule on a transporter (Fig. 2). This information may suggest functionally important sites of the protein encoded by the ORFs. Other analyses, such as checking for conserved amino acid residues at the putative heterogen-binding sites and calculating binding energy should also be performed for rigorous binding site prediction.

The FAMSBASE website. Species names whose genome sequences are available are listed at the top page. Search tools are listed at the bottom of the page.

Predicted interactions of modeled structure and ATP. The enlarged FAMSBASE can be searched by names of heterogen molecules attached to template structures. When enlarged FAMSBASE is searched by ‘ATP’, ORFs whose template 3D structures were solved with ATP are listed. The 3D structure can be shown with the heterogen atoms. Note that the location of heterogen atoms was not optimized using the modeled 3D structures. A model structure is shown in yellow and ATP is shown in colors that clarify differences of atoms.

STATISTICS IN FAMSBASE

Enlarged FAMSBASE contains protein 3D structure models for whole genomes of 41 species (Table 1). The number of ORFs with 3D structure is now 51 430. This number consists of about 42% of whole ORFs of 41 species (Table 1). A percentage of 3D structures against the number of ORFs in the bacteriophage T4 genome is relatively small compared to that of other genomes. This is due to the sequence diversity of proteins encoded by the bacteriophage genome, and may reflect distinct evolution of this organism. In enlarged FAMSBASE, each ORF has at most five 3D structure models. The five models were created based on the top five hits using PSI-BLAST against PDB, as shown in GTOP. When the number of hits was less than five, all the hits were used as the template. The average number of models for each ORF was three. A user can compare the five models for a single ORF and assume a reliable 3D structure. When the modeled structures are completely different from one another, even though the models are supposed to be of the same domain, then the modeled structure is unreliable. The number of models in the current FAMSBASE is 183 805.

Table 1. Species and proportion of protein 3D structures in enlarged FAMSBASE.

Organism	# model	# ORF	# modeled ORF (%)
Aeropyrum pernix (aero)	2177	2694	620 (23.0)
Archaeoglobus fulgidus (aful)	3530	2407	996 (41.4)
Halobacterium sp. (hbsp)	2845	2058	843 (41.0)
Methanococcus jannaschii (mjan)	2471	1715	698 (40.7)
Methanobacterium thermoautotrophicum (mthe)	2789	1869	798 (42.7)
Pyrococcus abyssi (pabys)	2818	1765	807 (45.7)
Pyrococcus horikoshii (pyro)	2524	2061	734 (35.6)
Thermoplasma acidophilum (tacid)	2469	1478	713 (48.2)
Aquifex aeolicus (aqua)	2574	2694	749 (27.8)
Borrelia burgdorferi (bbur)	1467	1255	451 (35.9)
Bacillus halodurans (bhal)	6314	4066	1768 (43.5)
Bacillus subtilis (bsub)	6346	4100	1794 (43.8)
Buchnera sp. APS (buch)	1208	574	372 (64.8)
Campylobacter jejuni (cjej)	2715	1634	793 (48.5)
Chlamydophila pneumoniae (cpneu)	1442	1052	434 (41.3)
Chlamydia trachomatis (ctra)	1378	894	405 (45.3)
Chlamydia muridarum (ctraM)	1406	909	422 (46.4)
Deinococcus radiodurans (drad)	4454	3102	1282 (41.3)
Escherichia coli (ecoli)	6922	4289	1997 (46.6)
Escherichia coli O157:H7 (ecoli_O157)	7557	5349	2228 (41.7)
Haemophilus influenzae (hinf)	2886	1709	860 (50.3)
Helicobacter pylori (hpyl)	2211	1566	643 (41.1)
Lactococcus lactis (llact)	3733	2266	1088 (48.0)
Mycoplasma genitalium (mgen)	834	480	252 (52.5)
Mycoplasma pneumoniae (mpneu)	944	688	283 (41.1)
Mycobacterium tuberculosis (mtub)	6033	4066	1722 (42.4)
Neisseria meningitidis (nmen)	2951	2025	853 (42.1)
Pseudomonas aeruginosa (paer)	9238	5565	2659 (47.8)
Pasteurella multocida (pmul)	3557	2014	1045 (51.9)
Rickettsia prowazekii (rpxx)	1389	834	413 (49.5)
Synechocystis sp. PCC6803 (syne)	4876	3167	1364 (43.1)
Thermotoga maritima (tmar)	2998	1864	866 (46.5)
Treponema pallidum (tpal)	1410	1031	402 (39.0)
Ureaplasma urealyticum (uure)	947	611	290 (47.5)
Vibrio cholerae (vcho)	5655	3828	1634 (42.7)
Xylella fastidiosa (xfas)	3328	2831	971 (34.0)
Caenorhabditis elegans (cele)	27 297	19 730	7408 (37.6)
Drosophila melanogaster (dmel)	25 011	14 335	6345 (44.3)
Homo sapiens (huge)	3760	1771	930 (52.5)
Saccharomyces cerevisiae (yst)	9230	6305	2453 (38.9)
Bacteriophage T4 (t4)	111	275	4 (16.4)
Total	183 805	122 926	51 430 (41.8)

Open in a new tab

When each 3D structure of ORFs is checked in detail, one will find that only a few ORFs are fully modeled. Most 3D structure models are of parts of the ORFs, which are supposed to represent domains (Fig. 3). This situation is, however, different among superkingdoms. In archaea and eubacteria genomes, more than 50% of all ORFs have 3D structures for a more than 80% portion of their ORFs. On average, 71% of each ORF in archaea and 68% of each ORF in eubacteria are modeled. In eukaryotic genomes, however, less than 40% of ORFs have 3D structures for a more than 80% portion of their ORFs. On average, a 39% portion of each ORF is modeled. This is a consequence of the multi-domain structure of proteins in eukaryotes (15). Furthermore, it indicates that our knowledge of eukaryotic proteins is not sufficient to understand the whole structure of single proteins in eukaryotes. Knowledge of domain–domain interactions within single ORFs in eukaryotic proteins will be required soon. Even with X-ray crystallography, structural determination of an entire eukaryotic protein is a difficult task because of its large mass.

Percentage of modeled portions of each ORF. Difference in coverage of ORFs by 3D structure is shown in different colors, as explained in the right side of the figure. White means less than 10% of an amino acid sequence, light gray means more than 10% but less than 20%, dark gray means more than 20% but less than 30% of an amino acid sequence, and likewise. In archaeal and eubacterial genomes, more than half of the ORFs are modeled at a more than 90% portion of the sequences. In eukaryotic genomes, however, less than 30% of the ORFs are modeled at a more than 90% portion of the sequences. This is because eukaryotic proteins have long amino acid sequences and multi-domain organization (15).

The superfamilies of modeled structures differ among the three superkingdoms (Table 2). The structures are classified based on the SCOP category (16). The most common model in all three superkingdoms is a P-loop protein. After the P-loop, the most common folds differ among each superkingdom. In eukaryotes, protein kinase, homeodomain and EGF/Laminin nuclear receptor models are included in the top ten entries, and all of these domains are known to diverge in eukaryotic genomes (17). This distribution is similar to that reported based on the whole genome protein fold assignment by Koonin et al. (18). Enlarged FAMSBASE provides coordinates of each protein within the superfamily and provides a chance to analyze the differences among proteins of the same superfamily.

Table 2. Top 10 most common 3D structures among the three superkingdoms of life.

No. of models	Superfamily name
Archaea
2960	P-loop containing nucleotide triphosphate hydrolases
913	4Fe–4S ferredoxins
828	S-adenosyl-L-methionine-dependent methyltransferases
699	PLP-dependent transferases
566	NAD(P)-binding Rossmann-fold domains
547	Metallo-hydrolase/oxidoreductase
433	FAD/NAD-linked reductases, dimerisation (C-terminal) domain
404	Class II aaRS and biotin synthetases
343	Nucleotidylyl transferase
339	Nucleotide-diphospho-sugar transferases
Eubacteria
11 387	P-loop containing nucleotide triphosphate hydrolases
3718	NAD(P)-binding Rossmann-fold domains
2837	CheY-like
2805	S-adenosyl-L-methionine-dependent methyltransferases
2699	PLP-dependent transferases
2066	Periplasmic binding protein-like II
1790	Thioredoxin-like
1639	alpha/beta-Hydrolases
1553	FAD/NAD-linked reductases, dimerisation (C-terminal) domain
1414	Class II aaRS and biotin synthetases
Eukaryote
5203	P-loop containing nucleotide triphosphate hydrolases
4601	Protein kinase-like (PK-like)
2910	C2H2 and C2HC zinc fingers
1842	EGF/Laminin
1487	alpha/beta-Hydrolases
1418	RNA-binding domain, RBD
1365	NAD(P)-binding Rossmann-fold domains
1312	Thioredoxin-like
1303	Nuclear receptor ligand-binding domain
1149	Homeodomain-like

Open in a new tab

ACCURACY OF THE MODELS

The accuracy of modeled structures is known to depend on the level of sequence identity between target and modeled proteins (19). The distribution of sequence identities in enlarged FAMSBASE is given in Figure 4. About a quarter of all models have more than 25% sequence identity. The reliability of the models is expressed by Hubbard plots (20) (Fig. 5). Since building the current enlarged FAMSBASE, the 3D structures of some target proteins have been determined. Comparison of the models in enlarged FAMSBASE with the real 3D structures is, therefore, a good blind test. When sequence identity is more than 25%, the model is reasonably good, with the exception of a few cases. Of the 212 tested models, 181 (85%) have RMSD (root mean square deviation) less than 3.0 Å through at least 90% of the entire structure.

Identity distribution between target and template sequences in enlarged FAMSBASE. Sequence identities are shown by color, as explained on the right side of the figure. White means template and target sequences have less than 10% sequence identity, light gray means between 10 and 20%, and likewise. Models with less than 20% sequence identity occupy about half of the database. Structural genomics projects are expected to provide better templates for genome-wide comparative modeling.

Hubbard plots of 212 modeled 3D structures and real structures with sequence identity of more than 25%. The 3D structures of 212 proteins were determined after building enlarged FAMSBASE. The horizontal axis is the number of superimposed residues and the vertical axis is the best root mean square deviation given by the number of superimposed residues. A precise 3D model has a small RMSD for superimposition of many residues. An unreliable 3D model has a large RMSD for superimposition of a few residues. See reference 20 for detail.

About 75% of the models in enlarged FAMSBASE have less than 25% sequence identity. Even with models based on low sequence identity, appropriate analysis can be performed (19,21). In one case, a homology model based on an alignment of less than 18% sequence identity yielded a significant biological result (22). Hubbard plots between the modeled protein 3D structures in enlarged FAMSBASE and the real target 3D structures, reported after building FAMSBASE and showing less than 25% sequence identity, are shown in Figure 6. Of 237 examined models, 73 (31%) have RMSD less than 3.0 Å through at least 90% of the entire structure. The blind test suggests that at least 31% of the modeled structures with sequence identity less than 25%, that is 37 428 out of 120 737 modeled structures, were reasonably accurate.

Hubbard plots of modeled 3D structures and real structures with sequence identity less than 25%. The 3D structures of 237 proteins were determined after building enlarged FAMSBASE.

Even with FAMS, protein 3D structures derived from only about 42% of ORFs were modeled. To generate protein 3D models of the entire ORF encoded in a genome, two efforts are underway. One is to let structural genomics projects solve protein structures that can be used as templates for a wide range of proteins. The other is to further improve the method of homology modeling to enable researchers to build highly reliable model structures based on a template of less than 20% sequence identity. With both efforts, the information from genome sequences will begin to be used for biologically important issues, such as functional site analyses, ligand docking and protein–protein interactions.

FUTURE DIRECTIONS

FAMSBASE will be expanded by increasing the number of genomes with protein 3D structures.

Acknowledgments

ACKNOWLEDGEMENTS

This work was supported by a Grant-in-Aid for Scientific Research on Priority Area (C), Genome Information Science from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The authors are grateful to Dr. Ken Nishikawa of NIG for the gift of genome sequence information of 41 species processed through PSI-BLAST analysis against amino acid sequences of proteins in PDB.

REFERENCES

1.Nierman W.C., Eisen,J.A., Fleischmann,R.D. and Fraser,C.M. (2000) Genome data: what do we learn? Curr. Opin. Struct. Biol., 10, 343–348. [DOI] [PubMed] [Google Scholar]
2.Kim S.-H. (2000) Structural genomics of microbes: an objective. Curr. Opin. Struct. Biol., 10, 380–383. [DOI] [PubMed] [Google Scholar]
3.Searls D.B. (2000) Bioinformatics tools for whole genomes. Annu. Rev. Genomics Hum. Genet., 1, 251–279. [DOI] [PubMed] [Google Scholar]
4.Domingues F.S., Koppensteiner,W.A. and Sippl,M.J. (2000) The role of protein structure in genomics. FEBS Lett., 476, 98–102. [DOI] [PubMed] [Google Scholar]
5.Brenner S.E. (2001) A tour of structural genomics. Nature Rev. Genet., 2, 801–809. [DOI] [PubMed] [Google Scholar]
6.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota, M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) 3D–1D modeling of all proteins encoded in the genome. Research and Development for Accelerating the Construction of the Infrastructure for Biological Resource Information (Bio-informatics), April 1999 to March 2000. Japan Bio-industry Association Foundation, Tokyo, Japan, pp. 633–666.
7.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota,M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) FAMSBASE: construction of model data base by homology analysis and full automatic comparative modeling for all ORFs of Escherichia coli and Bacillus subtilis. 28th Symposium on Structure–Activity Relationships, Pharmaceutical Society of Japan. October 26–27, Kyoto, Japan, pp. 222–225, 343.
8.Kawabata T., Fukuchi,S., Homma,K., Ota,M., Araki,J., Ito,T., Ichiyoshi,N. and Nishikawa,K. (2002) GTOP: a database of protein structures predicted from genome sequences. Nucleic Acids Res., 30, 294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Berman H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ogata K. and Umeyama,H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graph. Model., 18, 258–272. [DOI] [PubMed] [Google Scholar]
12.Iwadate M., Ebisawa,K. and Umeyama,H. (2001) Comparative Modeling of CAFASP2 competition. Chem-Bio Informatics J., 1, 136–148. [Google Scholar]
13.Fischer D., Elofsson,A., Rychlewski,L., Pazos,F., Valencia,A., Rost,B., Ortiz,A.R. and Dunbrack,R.L.Jr (2001) CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins, 5 (Suppl.), 171–183. [DOI] [PubMed] [Google Scholar]
14.Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [DOI] [PubMed] [Google Scholar]
16.Lo Conte L., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Ponting C.P., Schultz,J.,Copley,R.R., Andrade,M.A. and Bork,P. (2000) Evolution of domain families. Adv. Protein Chem., 54, 185–244. [DOI] [PubMed] [Google Scholar]
18.Koonin E.V., Wolf,Y.I. and Aravind,L. (2000) Protein Fold recognition using sequence profiles and its application in structural genomics. Adv. Protein Chem., 54, 245–275. [DOI] [PubMed] [Google Scholar]
19.Baker D. and Sali,A. (2001) Protein structure prediction and structural genomics. Science, 294, 93–96. [DOI] [PubMed] [Google Scholar]
20.Hubbard T.J. (1999) RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions. Proteins: Struct. Func. Genet., 3 (Suppl.), 15–21. [DOI] [PubMed] [Google Scholar]
21.Irving J.A., Whisstock,J.C. and Lesk,A.M. (2001) Protein structural alignments and functional genomics. Proteins, 42, 378–382. [DOI] [PubMed] [Google Scholar]
22.Smith B.J., Lawrence,M.C. and Colman,P.M. (2002) Modelling the structure of the fusion protein from human respiratory syncytial virus. Protein Eng., 15, 365–371. [DOI] [PubMed] [Google Scholar]

[gkg117c1] 1.Nierman W.C., Eisen,J.A., Fleischmann,R.D. and Fraser,C.M. (2000) Genome data: what do we learn? Curr. Opin. Struct. Biol., 10, 343–348. [DOI] [PubMed] [Google Scholar]

[gkg117c2] 2.Kim S.-H. (2000) Structural genomics of microbes: an objective. Curr. Opin. Struct. Biol., 10, 380–383. [DOI] [PubMed] [Google Scholar]

[gkg117c3] 3.Searls D.B. (2000) Bioinformatics tools for whole genomes. Annu. Rev. Genomics Hum. Genet., 1, 251–279. [DOI] [PubMed] [Google Scholar]

[gkg117c4] 4.Domingues F.S., Koppensteiner,W.A. and Sippl,M.J. (2000) The role of protein structure in genomics. FEBS Lett., 476, 98–102. [DOI] [PubMed] [Google Scholar]

[gkg117c5] 5.Brenner S.E. (2001) A tour of structural genomics. Nature Rev. Genet., 2, 801–809. [DOI] [PubMed] [Google Scholar]

[gkg117c6] 6.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota, M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) 3D–1D modeling of all proteins encoded in the genome. Research and Development for Accelerating the Construction of the Infrastructure for Biological Resource Information (Bio-informatics), April 1999 to March 2000. Japan Bio-industry Association Foundation, Tokyo, Japan, pp. 633–666.

[gkg117c7] 7.Ebisawa K., Iwadate,M., Takeda-Shitaka,M., Kurihara,Y., Ishii,T., Ota,M., Kawabata,T., Nishikawa,K., Mitsuhashi,M., Oyama,A., Asogawa,M., Yanagida,S., Okumura,C., Sugio,S., Matsuzaki,T., Takahashi,M., Suzuki,E., Tanimura,R., Aoki,T., Saito,S. and Umeyama,H. (2000) FAMSBASE: construction of model data base by homology analysis and full automatic comparative modeling for all ORFs of Escherichia coli and Bacillus subtilis. 28th Symposium on Structure–Activity Relationships, Pharmaceutical Society of Japan. October 26–27, Kyoto, Japan, pp. 222–225, 343.

[gkg117c8] 8.Kawabata T., Fukuchi,S., Homma,K., Ota,M., Araki,J., Ito,T., Ichiyoshi,N. and Nishikawa,K. (2002) GTOP: a database of protein structures predicted from genome sequences. Nucleic Acids Res., 30, 294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg117c9] 9.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg117c10] 10.Berman H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg117c11] 11.Ogata K. and Umeyama,H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graph. Model., 18, 258–272. [DOI] [PubMed] [Google Scholar]

[gkg117c12] 12.Iwadate M., Ebisawa,K. and Umeyama,H. (2001) Comparative Modeling of CAFASP2 competition. Chem-Bio Informatics J., 1, 136–148. [Google Scholar]

[gkg117c13] 13.Fischer D., Elofsson,A., Rychlewski,L., Pazos,F., Valencia,A., Rost,B., Ortiz,A.R. and Dunbrack,R.L.Jr (2001) CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins, 5 (Suppl.), 171–183. [DOI] [PubMed] [Google Scholar]

[gkg117c14] 14.Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg117c15] 15.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [DOI] [PubMed] [Google Scholar]

[gkg117c16] 16.Lo Conte L., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg117c17] 17.Ponting C.P., Schultz,J.,Copley,R.R., Andrade,M.A. and Bork,P. (2000) Evolution of domain families. Adv. Protein Chem., 54, 185–244. [DOI] [PubMed] [Google Scholar]

[gkg117c18] 18.Koonin E.V., Wolf,Y.I. and Aravind,L. (2000) Protein Fold recognition using sequence profiles and its application in structural genomics. Adv. Protein Chem., 54, 245–275. [DOI] [PubMed] [Google Scholar]

[gkg117c19] 19.Baker D. and Sali,A. (2001) Protein structure prediction and structural genomics. Science, 294, 93–96. [DOI] [PubMed] [Google Scholar]

[gkg117c20] 20.Hubbard T.J. (1999) RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions. Proteins: Struct. Func. Genet., 3 (Suppl.), 15–21. [DOI] [PubMed] [Google Scholar]

[gkg117c21] 21.Irving J.A., Whisstock,J.C. and Lesk,A.M. (2001) Protein structural alignments and functional genomics. Proteins, 42, 378–382. [DOI] [PubMed] [Google Scholar]

[gkg117c22] 22.Smith B.J., Lawrence,M.C. and Colman,P.M. (2002) Modelling the structure of the fusion protein from human respiratory syncytial virus. Protein Eng., 15, 365–371. [DOI] [PubMed] [Google Scholar]

PERMALINK

Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species

Akihiro Yamaguchi

Mitsuo Iwadate

Ei-ichiro Suzuki

Kei Yura

Shigetsugu Kawakita

Hideaki Umeyama

Mitiko Go

Abstract

INTRODUCTION

FEATURES OF FAMSBASE

Figure 1.

Figure 2.

STATISTICS IN FAMSBASE

Table 1. Species and proportion of protein 3D structures in enlarged FAMSBASE.

Figure 3.

Table 2. Top 10 most common 3D structures among the three superkingdoms of life.

ACCURACY OF THE MODELS

Figure 4.

Figure 5.

Figure 6.

FUTURE DIRECTIONS

Acknowledgments

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species

Akihiro Yamaguchi

Mitsuo Iwadate

Ei-ichiro Suzuki

Kei Yura

Shigetsugu Kawakita

Hideaki Umeyama

Mitiko Go

Abstract

INTRODUCTION

FEATURES OF FAMSBASE

Figure 1.

Figure 2.

STATISTICS IN FAMSBASE

Table 1. Species and proportion of protein 3D structures in enlarged FAMSBASE.

Figure 3.

Table 2. Top 10 most common 3D structures among the three superkingdoms of life.

ACCURACY OF THE MODELS

Figure 4.

Figure 5.

Figure 6.

FUTURE DIRECTIONS

Acknowledgments

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases